Note: To load the datatset we decided to use PySpark and then use Pandas to answer the research questions.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
# spark is an existing SparkSession
spark
SparkSession - in-memory
# import libraries
from scipy import stats
from scipy.stats import mannwhitneyu
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType, DoubleType, LongType
import random
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
import warnings
import json
from datetime import datetime
from statistics import mode
import functions as func
# Load the data into DataFrames
df_lighter_books = spark.read.json("lighter_books.json")
df_lighter_authors = spark.read.json("lighter_authors.json")
df_list = spark.read.json("list.json")
# Views of the files
df_lighter_books.createOrReplaceTempView("lighter_books")
df_lighter_authors.createOrReplaceTempView("lighter_authors")
# Choosing a style for our plots
plt.style.use('ggplot')
plt.rcParams['font.family'] = 'Serif'
our_colors = ["#1f487e","#1d3461", "#376996", "#6290c8", "#829cbc"]
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=our_colors) #for the piechart
We have two main files in which our data are saved, "lighter_authors.json" and "lighter_books.json".
Books_lighter has 26 columns and a total of 7027431 rows while Authors_lighter has 12 columns and 351767 rows.
It is also useful to print the list with all the columns of the datasets.
Usually the data cleaning process is done in the exploratory analysis, however, given the size of the dataset we decided to clean the columns of interest when needed.
# Dropping duplicates
df_lighter_books = df_lighter_books.dropDuplicates()
df_lighter_authors = df_lighter_authors.dropDuplicates()
# Let's see the head of Books
books_5 = spark.sql("SELECT (*) FROM lighter_books LIMIT 5").toPandas() # Transofrming the pysparkquery to a pandas dataframe
display(books_5)
print("The total number of columns is:", len(df_lighter_books.columns))
func.count_columns_type(df_lighter_books)
count_books = spark.sql("SELECT COUNT(*) FROM lighter_books").toPandas()
print("The number of rows in the books file is:" , count_books.iloc[0,0])
| asin | author_id | author_name | authors | average_rating | description | edition_information | format | id | image_url | ... | publisher | rating_dist | ratings_count | series_id | series_name | series_position | shelves | text_reviews_count | title | work_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.50 | There is a door at the end of a silent corrido... | US Edition | Paperback | 2 | https://i.gr-assets.com/images/S/compressed.ph... | ... | Scholastic Inc. | 5:1674064|4:664833|3:231195|2:41699|1:16215|to... | 2628006 | 45175 | Harry Potter | 5 | [(324191, to-read), (52675, currently-reading)... | 44716 | Harry Potter and the Order of the Phoenix (Har... | 2809203 | |
| 1 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.48 | Harry Potter's life is miserable. His parents ... | Library Edition | Hardcover | 3 | https://i.gr-assets.com/images/S/compressed.ph... | ... | Scholastic Inc | 5:4801606|4:1681521|3:623286|2:145898|1:125040... | 7377351 | 45175 | Harry Potter | 1 | [(63540, fantasy), (57260, favorites), (20889,... | 116930 | Harry Potter and the Sorcerer's Stone (Harry P... | 4640799 | |
| 2 | 1077326 | J.K. Rowling | None | 4.43 | The Dursleys were so mean and hideous that sum... | Hardcover | 4 | https://i.gr-assets.com/images/S/compressed.ph... | ... | Scholastic | 5:1690166|4:781011|3:313727|2:54687|1:15453|to... | 2855044 | 45175 | Harry Potter | 2 | [(282341, to-read), (67110, currently-reading)... | 55286 | Harry Potter and the Chamber of Secrets (Harry... | 6231171 | ||
| 3 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.57 | For twelve long years, the dread fortress of A... | Mass Market Paperback | 5 | https://i.gr-assets.com/images/S/compressed.ph... | ... | Scholastic Inc. | 5:1994597|4:696545|3:212678|2:28915|1:13959|to... | 2946694 | 45175 | Harry Potter | 3 | [(292815, to-read), (52580, fantasy), (49479, ... | 58023 | Harry Potter and the Prisoner of Azkaban (Harr... | 2402163 | ||
| 4 | 1077326 | J.K. Rowling | [(1077326, J.K. Rowling, ), (2927, Mary GrandP... | 4.56 | Harry Potter is midway through his training as... | First Scholastic Trade Paperback Edition | Paperback | 6 | https://i.gr-assets.com/images/S/compressed.ph... | ... | Scholastic | 5:1808039|4:663849|3:193604|2:27759|1:12425|to... | 2705676 | 45175 | Harry Potter | 4 | [(287086, to-read), (60111, currently-reading)... | 48637 | Harry Potter and the Goblet of Fire (Harry Pot... | 3046572 |
5 rows × 26 columns
The total number of columns is: 26 The number of categorial columns is: 17 The number of numeric columns is: 7 The number of other types of columns is: 2 The number of rows in the books file is: 7027431
# Head of authors data
authors_5 = spark.sql("SELECT (*) FROM lighter_authors LIMIT 5").toPandas()
display(authors_5)
print("The total number of columns is:", len(df_lighter_authors.columns))
func.count_columns_type(df_lighter_authors)
count_authors = spark.sql("SELECT COUNT(*) FROM lighter_authors").toPandas()
print("The number of rows in the authors file is:" , count_authors.iloc[0,0])
| about | average_rating | book_ids | fans_count | gender | id | image_url | name | ratings_count | text_reviews_count | work_ids | works_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Douglas Noël Adams was an English author, comi... | 4.19 | [386162, 13, 8695, 8694, 6091075, 365, 569429,... | 19826 | male | 4 | https://images.gr-assets.com/authors/159137433... | Douglas Adams | 2862064 | 62681 | [3078186, 135328, 1877624, 74123, 3078120, 104... | 106 |
| 1 | William McGuire "Bill" Bryson, OBE, FRS was bo... | 4.02 | [9791, 21, 28, 24, 7507825, 27, 10538, 25, 26,... | 16144 | male | 7 | https://images.gr-assets.com/authors/157859752... | Bill Bryson | 1417316 | 84176 | [613469, 2305997, 940892, 2611786, 7800569, 31... | 75 |
| 2 | Jude Fisher is the pseudonym for <a href="http... | 4.53 | [349254, 15222, 14833682, 15221, 18126815, 152... | 60 | female | 10 | https://images.gr-assets.com/authors/141145711... | Jude Fisher | 56159 | 352 | [17150, 808427, 20487307, 90550, 25460625, 171... | 14 |
| 3 | James Hamilton-Paterson's work has been transl... | 3.79 | [40, 9416484, 12482, 3753106, 26889789, 104764... | 72 | male | 12 | https://images.gr-assets.com/authors/127051738... | James Hamilton-Paterson | 3302 | 480 | [4417, 14300808, 14780, 3796968, 44703121, 103... | 45 |
| 4 | Mark Andrew Watson (born 13 February 1980) is ... | 3.60 | [8466327, 15739968, 22756778, 51026133, 260451... | 179 | male | 14 | https://images.gr-assets.com/authors/133175379... | Mark Watson | 7979 | 772 | [13330815, 19109351, 42306244, 72694240, 26291... | 61 |
The total number of columns is: 12 The number of categorial columns is: 4 The number of numeric columns is: 6 The number of other types of columns is: 2 The number of rows in the authors file is: 351767
# Let's see the columns of the datasets
df_lighter_books.columns
['asin', 'author_id', 'author_name', 'authors', 'average_rating', 'description', 'edition_information', 'format', 'id', 'image_url', 'isbn', 'isbn13', 'language', 'num_pages', 'original_publication_date', 'publication_date', 'publisher', 'rating_dist', 'ratings_count', 'series_id', 'series_name', 'series_position', 'shelves', 'text_reviews_count', 'title', 'work_id']
df_lighter_authors.columns
['about', 'average_rating', 'book_ids', 'fans_count', 'gender', 'id', 'image_url', 'name', 'ratings_count', 'text_reviews_count', 'work_ids', 'works_count']
We can continue to explore our dataset by doing different things depending on the type of variables we have:
We decided to show them for book format and publisher. The most commonly used formats are Paperback, Hardcover ,ebook, Audio CD and Kindle Edition while the most popular publishers are Book on Demand Ltd , Createspace Independent Publishing Platform, Smashwords Edition Turtleback Books and Books LLC, Wiki Series.
# Descriptive statistics for numeric variables of books
statistics_num_var_book = df_lighter_books.select("average_rating", "edition_information", "ratings_count", "text_reviews_count").describe().toPandas()
statistics_num_var_book #1 minute to run
| summary | average_rating | edition_information | ratings_count | text_reviews_count | |
|---|---|---|---|---|---|
| 0 | count | 7027431 | 7027431 | 7027431 | 7027431 |
| 1 | mean | 3.306187915612408 | 4.2425113324187964E11 | 19458.609666035853 | 641.7257807298286 |
| 2 | stddev | 1.4357340879502622 | 1.8024371813681375E13 | 160900.7621571502 | 4169.806199983704 |
| 3 | min | 0.0 | 0 | 0 | |
| 4 | max | 5.0 | 1 | 7400639 | 174524 |
# Descriptive statistics for numeric variables of authors
statistics_num_var_authors = df_lighter_authors.select("average_rating", "fans_count", "ratings_count", "works_count").describe().toPandas()
statistics_num_var_authors #2 sec to run
| summary | average_rating | fans_count | ratings_count | works_count | |
|---|---|---|---|---|---|
| 0 | count | 351767 | 351767 | 351767 | 351767 |
| 1 | mean | 3.651194313281242 | 111.61573143586521 | 4770.586308550831 | 25.937137366495435 |
| 2 | stddev | 1.2114816423862007 | 2661.018138865955 | 97693.95378788888 | 3066.0828516236857 |
| 3 | min | -31.0 | -33 | -41 | 0 |
| 4 | max | 5.0 | 766035 | 27003752 | 1775176 |
# Cleaning the columns
df_lighter_books = df_lighter_books.filter(trim(col("format")) != "")
df_lighter_books = df_lighter_books.filter(trim(col("publisher")) != "")
df_lighter_books = df_lighter_books.dropna(subset=["format","publisher"])
df_lighter_books.createOrReplaceTempView("lighter_books") # needed to save the changes in the view
# For categorical variables we can make frequency tables for example abot the format of the books
formats = spark.sql("SELECT format, COUNT(*) FROM lighter_books GROUP BY format ORDER BY COUNT(*) DESC").toPandas()
# The same can be done for the publisher
publishers = spark.sql("SELECT publisher, COUNT(*) FROM lighter_books GROUP BY publisher ORDER BY COUNT(*) DESC").toPandas()
#2 minutes to run
# This is our frequency table to see the count of the most used formats
formats.head(5)
| format | count(1) | |
|---|---|---|
| 0 | Paperback | 2177540 |
| 1 | Hardcover | 952597 |
| 2 | ebook | 868525 |
| 3 | Audio CD | 117705 |
| 4 | Kindle Edition | 109497 |
# The same for the most popular publishers
publishers.head(5)
| publisher | count(1) | |
|---|---|---|
| 0 | Book on Demand Ltd. | 108860 |
| 1 | Createspace Independent Publishing Platform | 89302 |
| 2 | Smashwords Edition | 79087 |
| 3 | Turtleback Books | 58453 |
| 4 | Books LLC, Wiki Series | 56899 |
# About the authors we can do a plot to see the distribution of the gender
# Cleaning the colum
df_lighter_authors = df_lighter_authors.filter(trim(col("gender")) != "")
df_lighter_authors = df_lighter_authors.dropna(subset=["gender"])
# Creating a new column to fix the same gender but written in different ways
df_lighter_authors = df_lighter_authors.withColumn("gender_standard", when(col("gender").isin("non-binary", "Nonbinary", "Non-binary", "nonbinary"), "Non-binary").otherwise(col("gender")))
df_lighter_authors.createOrReplaceTempView("lighter_authors") #needed again because the standardized gender
gender_authors = spark.sql("SELECT gender_standard, COUNT(*) FROM lighter_authors GROUP BY gender_standard ORDER BY count(*) DESC").toPandas()
gender_authors.head(5)
# Most of the authors are males, than we have the female authors and then there are a few of other genders
| gender_standard | count(1) | |
|---|---|---|
| 0 | male | 138179 |
| 1 | female | 109878 |
| 2 | Non-binary | 95 |
| 3 | genderqueer | 28 |
| 4 | Agender | 11 |
plt.figure(figsize=(15,5))
plt.bar(gender_authors["gender_standard"].head(5),np.log(gender_authors["count(1)"].head(5))) #logaritmic scale to make the differences visible
plt.title('Gender distribution')
plt.xlabel('Gender')
plt.ylabel('Frequency in log-scale')
plt.show()
# Cleaning a bit the data
df_lighter_books = df_lighter_books.filter(trim(col("author_name")) != "") # removes rows with empty author_name
df_lighter_books = df_lighter_books.dropna(subset=["author_name"])
# Query to count how many times an author appears
freq_authors = df_lighter_books.groupBy("author_name").count().orderBy("count", ascending=False).toPandas() #5 minutes to run
# We see that some of them are not properly authors like "Anonymous","Unknown","NOT A BOOK" and so we eliminate them
freq_authors = freq_authors.drop([1,2,4,5,7])
# Let's visualize our authors data
freq_authors
| author_name | count | |
|---|---|---|
| 0 | Jesse Russell | 90083 |
| 3 | Philip M. Parker | 27415 |
| 6 | Books LLC | 18170 |
| 8 | William Shakespeare | 13680 |
| 9 | Agatha Christie | 9834 |
| ... | ... | ... |
| 237093 | Samantha Cross | 1 |
| 237094 | Ashlie Knapp | 1 |
| 237095 | İrfan Beytaş | 1 |
| 237096 | Judson Reese | 1 |
| 237097 | Петя Божилова | 1 |
237093 rows × 2 columns
There are 237093 rows in the dataset with the author's name and the number of books written, representing them all together would make the graph unreadable, so we thought of two strategies.
The idea is to recreate something like a stratified sampling, for us each stratum of the population consists of the number of books written and from each stratum we take only one author who precisely represents his stratum.
In this way we get 1303 authors.
To represent them all we created the function division_data that devides the dataset into n parts and then we made a barplot for each of the n divisions of the data.
However, the result we get is not very clear, there are some authors who have so much more than others and then the number of books written goes down to 1.
Our second attempt to make sure that we could get a readable graph but without also excluding those who wrote less was to make a sample from the stratified sampling already done.
freq_authors["count"].value_counts() #There are way to many authors to be seen in a plot, a possible strategy to show the distribution of the books is to
# sample one author for each value, we will have 1303 bars, maybe we should do 10 plots
freq_authors_sample = freq_authors.groupby("count").apply(lambda group: group.sample(1)).reset_index(drop=True).sort_values(by="count", ascending=False)
freq_authors_sample
| author_name | count | |
|---|---|---|
| 1037 | Jesse Russell | 90083 |
| 1036 | Philip M. Parker | 27415 |
| 1035 | Books LLC | 18170 |
| 1034 | William Shakespeare | 13680 |
| 1033 | Agatha Christie | 9834 |
| ... | ... | ... |
| 4 | Dorothy May Emerson | 5 |
| 3 | Dusko Popov | 4 |
| 2 | Geoff Micks | 3 |
| 1 | Wayne J. Street Jr. | 2 |
| 0 | Joey Fehrman | 1 |
1038 rows × 2 columns
warnings.filterwarnings("ignore") # not visualize the warnings because they are about the font of ggplot who does not have some special characters
divised = func.division_data(freq_authors_sample,4) # we divide the data in 4 parts
fig, axes = plt.subplots(len(divised), 1, figsize=(50, 50)) # we avoid sharey = True bacause all the other graphs would be not visible
for i,data in enumerate(divised):
ax = axes[i]
ax.bar(data["author_name"],data["count"])
ax.tick_params(axis='x', labelrotation=45)
ax.set_title(f"Number of book for aech author part {i + 1}")
#warnings.filterwarnings("default") #Getting back the warnings
freq_authors_sample_2 = freq_authors_sample.sample(30).sort_values(by="count", ascending=False)
plt.figure(figsize=(25,6))
plt.xticks(rotation = 45, fontsize = 8)
plt.bar(freq_authors_sample_2["author_name"], freq_authors_sample_2["count"])
plt.title("Sample of number of book for aech author")
plt.xlabel("Author Name")
plt.ylabel("Frequency")
plt.show()
#Now the graph is more readable but we are only showning a sample and not all the distro
max_rating_count = spark.sql("SELECT MAX(ratings_count) FROM lighter_books").toPandas()
max_rating_count = max_rating_count["max(ratings_count)"][0]
book_highest_review = spark.sql(f"SELECT title FROM lighter_books WHERE ratings_count == {max_rating_count}").toPandas()
#put the f to incorporate max_rating_count inside the query
book_highest_review["title"][0]
c:\Python\Lib\site-packages\pyspark\sql\pandas\utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
c:\Python\Lib\socket.py:777: ResourceWarning: unclosed <socket.socket fd=3952, family=2, type=1, proto=0, laddr=('127.0.0.1', 51049), raddr=('127.0.0.1', 51048)>
self._sock = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback
c:\Python\Lib\site-packages\pyspark\sql\pandas\utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
c:\Python\Lib\socket.py:777: ResourceWarning: unclosed <socket.socket fd=4244, family=2, type=1, proto=0, laddr=('127.0.0.1', 51052), raddr=('127.0.0.1', 51051)>
self._sock = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback
'Harry Potter und der Stein der Weisen (Harry Potter,#1)'
df_lighter_books = df_lighter_books.filter(trim(col("average_rating")) != "") # removes rows with empty average rating
df_lighter_books = df_lighter_books.dropna(subset=["average_rating"])
average_score_10_worse = spark.sql("SELECT title, average_rating FROM lighter_books ORDER BY average_rating LIMIT 10").toPandas() #5 minutes to run
average_score_10_worse
| title | average_rating | |
|---|---|---|
| 0 | Intimate Worship | 0.0 |
| 1 | Bible in Spoken Word Cassette Narrated by Char... | 0.0 |
| 2 | סופרים יהודיים - כתבים | 0.0 |
| 3 | Classic Fairy Stories: Traditional Tales: Trad... | 0.0 |
| 4 | Death of Aunt | 0.0 |
| 5 | Hindi New Testament | 0.0 |
| 6 | NOT A BOOK: Ithaca, Tompkins County, New York:... | 0.0 |
| 7 | Costa Rica, pura vida. Volumen II | 0.0 |
| 8 | Faith, Hope, and Luck Participant's Guide, Ses... | 0.0 |
| 9 | American Headway 1 | 0.0 |
average_score_10_best = spark.sql("SELECT title, average_rating FROM lighter_books ORDER BY average_rating DESC LIMIT 10").toPandas()
average_score_10_best
| title | average_rating | |
|---|---|---|
| 0 | Встречи | 5.0 |
| 1 | Wind | 5.0 |
| 2 | The Revolutionary Art of the Future: Rediscove... | 5.0 |
| 3 | Out of My Hands | 5.0 |
| 4 | Nirvana Drum Collection | 5.0 |
| 5 | Perfect Italian (Learn Italian with the Michel... | 5.0 |
| 6 | The Long Dive | 5.0 |
| 7 | Los Many Mundos of Raulrsalinas: Un Poetic Jaz... | 5.0 |
| 8 | De Gaulle and the United States: A Centennial ... | 5.0 |
| 9 | Precious Moments: Autumn Adventures | 5.0 |
df_lighter_books = df_lighter_books.filter(trim(col("language")) != "") # removes rows with empty language
df_lighter_books = df_lighter_books.dropna(subset=["language"])
df_languages = df_lighter_books.groupBy("language").count().toPandas()
df_languages = df_languages.sort_values(by="count", ascending=False)
df_languages
# There are 301 different languages!
| language | count | |
|---|---|---|
| 49 | eng | 884148 |
| 91 | ita | 98654 |
| 104 | ger | 91486 |
| 96 | spa | 90072 |
| 0 | fre | 76737 |
| ... | ... | ... |
| 8 | tuk | 1 |
| 166 | bat | 1 |
| 224 | men | 1 |
| 222 | tut | 1 |
| 300 | sam | 1 |
301 rows × 2 columns
English is the predominant language in the dataset, we decided to use a pie chart to represent it because visually it is the one that makes it clear which language is the most widely used.
We did not put all the languages in the chart because some occupy too small slices of the chart, we stopped at the top 30 most widely used languages.
plt.figure(figsize=(10,10))
plt.pie(df_languages["count"].head(30), labels= df_languages["language"].head(30), autopct='%1.1f%%', rotatelabels= True, textprops={'size': 8} )
plt.title("Pie chart of languages")
plt.show()
books_more_250 = spark.sql(" SELECT COUNT(*) FROM lighter_books WHERE num_pages > 250").toPandas()
print("There are", books_more_250["count(1)"][0], " with more taht 250 pages")
There are 1668596 with more taht 250 pages
prolific_50_fans = spark.sql("""
SELECT name, fans_count
FROM lighter_authors
WHERE name IN (SELECT author_name
FROM lighter_books
GROUP BY author_name
ORDER BY COUNT(*) DESC)
ORDER BY fans_count DESC
LIMIT 50
""")
prolific_50_fans_df = prolific_50_fans.toPandas()
prolific_50_fans_df.head(5)
| name | fans_count | |
|---|---|---|
| 0 | Stephen King | 766035 |
| 1 | Veronica Roth | 455358 |
| 2 | Colleen Hoover | 356633 |
| 3 | Rick Riordan | 339346 |
| 4 | James Patterson | 317040 |
# Usually a histogram is the right method to represent the distribution of a variable but in this case the fans_count cannot be considered
# a continuous variable and in fact the histogram has "holes" in it.
plt.figure(figsize=(25, 6))
plt.xticks(rotation=45, fontsize=8)
plt.hist(prolific_50_fans_df["fans_count"], bins=30)
plt.xlabel("Fans_Count")
plt.ylabel("Frequency")
plt.title("Histogram of Fans Count")
plt.show()
#Therefore, to have a graph that better represents the distribution of fans for the 50 most prolific authors, one can use a barplot
plt.figure(figsize=(25,6))
plt.xticks(rotation = 45, fontsize = 8)
plt.bar(prolific_50_fans_df["name"], prolific_50_fans_df["fans_count"])
plt.ylabel("Frequency")
plt.xlabel("Author Name")
plt.title("Barplot of Fans Count")
Text(0.5, 1.0, 'Barplot of Fans Count')
The number of books published that year.
The total number of pages written that year.
The most prolific month of that year.
The longest book written that year.
df_lighter_books = df_lighter_books.filter(trim(col('original_publication_date')) != '') # removes rows with empty original_publication_date
df_lighter_books = df_lighter_books.filter(trim(col('num_pages')) != '') # removes rows with empty num_pages
df_lighter_books = df_lighter_books.filter(trim(col('title')) != '') # removes rows with empty title
df_lighter_books = df_lighter_books.dropna(subset=['original_publication_date', 'num_pages','title']) # removes rows with null values in original_publication_date, num_pages, and title
# Define the regular expression for date
date_regex = r"^(19\d\d|200[0-2]|20(0[3-9]|1\d|2[0-3]))-(0[1-9]|1[012])-(0[1-9]|1[0-9]|2[0-9]|3[01])$"
# Validate the original_publication_date column
df_lighter_books = df_lighter_books.withColumn(
"is_valid_date",
F.when(F.col("original_publication_date").rlike(date_regex), True).otherwise(False)
)
# Filter the DataFrame to keep only rows with valid dates
df_lighter_books = df_lighter_books.filter(F.col("is_valid_date") == True)
# Drop the is_valid_date column
df_lighter_books = df_lighter_books.drop("is_valid_date")
# Attempt to convert original_publication_date to a date
df_lighter_books = df_lighter_books.withColumn("converted_date", to_date(col("original_publication_date"), 'yyyy-MM-dd'))
# Filter out rows where the conversion resulted in null
df_lighter_books = df_lighter_books.filter(col("converted_date").isNotNull())
# Drop the intermediate converted_date column
df_lighter_books = df_lighter_books.drop("converted_date")
df_lighter_books = df_lighter_books.withColumn("title", trim(df_lighter_books.title)) # removes leading and trailing white spaces
# Filter for valid num_pages
df_lighter_books = df_lighter_books.filter(col('num_pages') > 0)
# Select the relevant columns
df_lighter_books = df_lighter_books.select('original_publication_date', 'num_pages', 'title')
df_lighter_books.count() # count the number of rows
2390718
df_lighter_books.show(5) # show the first 5 rows
+-------------------------+---------+--------------------+ |original_publication_date|num_pages| title| +-------------------------+---------+--------------------+ | 2003-06-21| 870|Harry Potter and ...| | 1997-06-26| 309|Harry Potter and ...| | 1998-07-02| 352|Harry Potter and ...| | 1999-07-08| 435|Harry Potter and ...| | 2000-07-08| 734|Harry Potter and ...| +-------------------------+---------+--------------------+ only showing top 5 rows
df_lighter_books = df_lighter_books.withColumn('year', year('original_publication_date')) # add a year column
df_lighter_books = df_lighter_books.withColumn('month', month('original_publication_date')) # add a month column
df_lighter_books.show(5) # show the first 5 rows
+-------------------------+---------+--------------------+----+-----+ |original_publication_date|num_pages| title|year|month| +-------------------------+---------+--------------------+----+-----+ | 2003-06-21| 870|Harry Potter and ...|2003| 6| | 1997-06-26| 309|Harry Potter and ...|1997| 6| | 1998-07-02| 352|Harry Potter and ...|1998| 7| | 1999-07-08| 435|Harry Potter and ...|1999| 7| | 2000-07-08| 734|Harry Potter and ...|2000| 7| +-------------------------+---------+--------------------+----+-----+ only showing top 5 rows
df_lighter_books.select("year").distinct().count() # count the number of distinct years
124
max_year = df_lighter_books.agg(max("year")).first()[0] # find the maximum year
print(max_year)
2023
min_year = df_lighter_books.agg(min("year")).first()[0] # find the minimum year
print(min_year)
1900
# Get list of unique years
years = df_lighter_books.select('year').toPandas()['year'].unique().tolist()
years.sort() # sort the list (not necessary)
data_first_10 = [func.yearly_stats(df_lighter_books,year) for year in years[:10]] # get the yearly stats for the first 10 years
data_last_10 = [func.yearly_stats(df_lighter_books,year) for year in years[-10:]] # get the yearly stats for the last 10 years
data_first_10 # print the yearly stats for the first 10 years
[(1900, 1747, 509501, 1, 'Complete Works of Joseph Conrad'), (1901, 802, 351965, 1, 'NKJV Study Bible'), (1902, 882, 696460, 1, 'Holy Bible: NLT - New Living Translation'), (1903, 733, 183484, 1, 'The Life of William Ewart Gladstone - Vol. I'), (1904, 628, 162193, 1, 'The Life Recovery Bible NLT'), (1905, 1119, 282252, 1, 'Dictionary of the Bible'), (1906, 518, 145632, 1, "Moody's Magazine Vol 1 - 20"), (1907, 558, 137774, 1, 'Arsenio Lupin, Caballero Ladrón'), (1908, 430, 97105, 1, 'Anne of Green Gables--The Complete Collection'), (1909, 1041, 298330, 1, 'The Works of Rudyard Kipling, 10 Vols')]
data_last_10 # print the yearly stats for the last 10 years
[(2014, 177582, 42919290, 1, 'A Most Unlikely Countess (To Love a Wildcat, #2)'), (2015, 53821, 14013926, 1, 'Revel for the American Nation: A History of the United States, Combined Volume -- Access Card'), (2016, 2299, 708061, 1, 'Homestuck'), (2017, 491, 150653, 2, 'The Starfarers Quartet'), (2018, 192, 72498, 1, '地海六部曲'), (2019, 118, 34813, 8, 'Nouvelles Les'), (2020, 83, 27477, 1, 'The Complete Ripley Novels (Ripley, #1-5)'), (2021, 17, 8088, 8, "The Navigator's Children (The Last King of Osten Ard, #3)"), (2022, 4, 1280, 10, 'Highland Ever After (The Montgomerys and Armstrongs, #3)'), (2023, 1, 463, 9, 'Apocalypse')]
data_first_10 = pd.DataFrame(data_first_10, columns =['year', 'num_books', 'total_pages', 'prolific_month', 'longest_book']) # convert the list to a pandas dataframe
data_first_10 # print the dataframe
| year | num_books | total_pages | prolific_month | longest_book | |
|---|---|---|---|---|---|
| 0 | 1900 | 1747 | 509501 | 1 | Complete Works of Joseph Conrad |
| 1 | 1901 | 802 | 351965 | 1 | NKJV Study Bible |
| 2 | 1902 | 882 | 696460 | 1 | Holy Bible: NLT - New Living Translation |
| 3 | 1903 | 733 | 183484 | 1 | The Life of William Ewart Gladstone - Vol. I |
| 4 | 1904 | 628 | 162193 | 1 | The Life Recovery Bible NLT |
| 5 | 1905 | 1119 | 282252 | 1 | Dictionary of the Bible |
| 6 | 1906 | 518 | 145632 | 1 | Moody's Magazine Vol 1 - 20 |
| 7 | 1907 | 558 | 137774 | 1 | Arsenio Lupin, Caballero Ladrón |
| 8 | 1908 | 430 | 97105 | 1 | Anne of Green Gables--The Complete Collection |
| 9 | 1909 | 1041 | 298330 | 1 | The Works of Rudyard Kipling, 10 Vols |
data_last_10 = pd.DataFrame(data_last_10, columns =['year', 'num_books', 'total_pages', 'prolific_month', 'longest_book']) # convert the list to a pandas dataframe
data_last_10 # print the dataframe
| year | num_books | total_pages | prolific_month | longest_book | |
|---|---|---|---|---|---|
| 0 | 2014 | 177582 | 42919290 | 1 | A Most Unlikely Countess (To Love a Wildcat, #2) |
| 1 | 2015 | 53821 | 14013926 | 1 | Revel for the American Nation: A History of th... |
| 2 | 2016 | 2299 | 708061 | 1 | Homestuck |
| 3 | 2017 | 491 | 150653 | 2 | The Starfarers Quartet |
| 4 | 2018 | 192 | 72498 | 1 | 地海六部曲 |
| 5 | 2019 | 118 | 34813 | 8 | Nouvelles Les |
| 6 | 2020 | 83 | 27477 | 1 | The Complete Ripley Novels (Ripley, #1-5) |
| 7 | 2021 | 17 | 8088 | 8 | The Navigator's Children (The Last King of Ost... |
| 8 | 2022 | 4 | 1280 | 10 | Highland Ever After (The Montgomerys and Armst... |
| 9 | 2023 | 1 | 463 | 9 | Apocalypse |
df_lighter_books_pd = df_lighter_books.toPandas() # convert the dataframe to a pandas dataframe
# Get list of unique years
years = df_lighter_books_pd['year'].unique().tolist()
years.sort() # sort the list (not necessary)
data_first_10 = [func.gpt_yearly_stats(df_lighter_books_pd,year) for year in years[:10]] # get the yearly stats for the first 10 years
data_last_10 = [func.gpt_yearly_stats(df_lighter_books_pd,year) for year in years[-10:]] # get the yearly stats for the last 10 years
data_first_10 # print the yearly stats for the first 10 years
[(1900, 1747, 509501, 1, 'Complete Works of Joseph Conrad'), (1901, 802, 351965, 1, 'NKJV Study Bible'), (1902, 882, 696460, 1, 'Holy Bible: NLT - New Living Translation'), (1903, 733, 183484, 1, 'The Life of William Ewart Gladstone - Vol. I'), (1904, 628, 162193, 1, 'The Life Recovery Bible NLT'), (1905, 1119, 282252, 1, 'Dictionary of the Bible'), (1906, 518, 145632, 1, "Moody's Magazine Vol 1 - 20"), (1907, 558, 137774, 1, 'Arsenio Lupin, Caballero Ladrón'), (1908, 430, 97105, 1, 'Anne of Green Gables--The Complete Collection'), (1909, 1041, 298330, 1, 'The Works of Rudyard Kipling, 10 Vols')]
data_last_10 # print the yearly stats for the last 10 years
[(2014, 177582, 42919290, 1, 'A Most Unlikely Countess (To Love a Wildcat, #2)'), (2015, 53821, 14013926, 1, 'Revel for the American Nation: A History of the United States, Combined Volume -- Access Card'), (2016, 2299, 708061, 1, 'Homestuck'), (2017, 491, 150653, 2, 'The Starfarers Quartet'), (2018, 192, 72498, 1, '地海六部曲'), (2019, 118, 34813, 8, 'Nouvelles Les'), (2020, 83, 27477, 1, 'The Complete Ripley Novels (Ripley, #1-5)'), (2021, 17, 8088, 8, "The Navigator's Children (The Last King of Osten Ard, #3)"), (2022, 4, 1280, 10, 'Highland Ever After (The Montgomerys and Armstrongs, #3)'), (2023, 1, 463, 9, 'Apocalypse')]
data_first_10 = pd.DataFrame(data_first_10, columns =['year', 'num_books', 'total_pages', 'prolific_month', 'longest_book']) # convert the list to a pandas dataframe
data_first_10 # print the dataframe
| year | num_books | total_pages | prolific_month | longest_book | |
|---|---|---|---|---|---|
| 0 | 1900 | 1747 | 509501 | 1 | Complete Works of Joseph Conrad |
| 1 | 1901 | 802 | 351965 | 1 | NKJV Study Bible |
| 2 | 1902 | 882 | 696460 | 1 | Holy Bible: NLT - New Living Translation |
| 3 | 1903 | 733 | 183484 | 1 | The Life of William Ewart Gladstone - Vol. I |
| 4 | 1904 | 628 | 162193 | 1 | The Life Recovery Bible NLT |
| 5 | 1905 | 1119 | 282252 | 1 | Dictionary of the Bible |
| 6 | 1906 | 518 | 145632 | 1 | Moody's Magazine Vol 1 - 20 |
| 7 | 1907 | 558 | 137774 | 1 | Arsenio Lupin, Caballero Ladrón |
| 8 | 1908 | 430 | 97105 | 1 | Anne of Green Gables--The Complete Collection |
| 9 | 1909 | 1041 | 298330 | 1 | The Works of Rudyard Kipling, 10 Vols |
data_last_10 = pd.DataFrame(data_last_10, columns =['year', 'num_books', 'total_pages', 'prolific_month', 'longest_book']) # convert the list to a pandas dataframe
data_last_10 # print the dataframe
| year | num_books | total_pages | prolific_month | longest_book | |
|---|---|---|---|---|---|
| 0 | 2014 | 177582 | 42919290 | 1 | A Most Unlikely Countess (To Love a Wildcat, #2) |
| 1 | 2015 | 53821 | 14013926 | 1 | Revel for the American Nation: A History of th... |
| 2 | 2016 | 2299 | 708061 | 1 | Homestuck |
| 3 | 2017 | 491 | 150653 | 2 | The Starfarers Quartet |
| 4 | 2018 | 192 | 72498 | 1 | 地海六部曲 |
| 5 | 2019 | 118 | 34813 | 8 | Nouvelles Les |
| 6 | 2020 | 83 | 27477 | 1 | The Complete Ripley Novels (Ripley, #1-5) |
| 7 | 2021 | 17 | 8088 | 8 | The Navigator's Children (The Last King of Ost... |
| 8 | 2022 | 4 | 1280 | 10 | Highland Ever After (The Montgomerys and Armst... |
| 9 | 2023 | 1 | 463 | 9 | Apocalypse |
My implementation and the ChatGPT implementation seem to be doing the same job, but we are using different libraries. My implementation uses PySpark, which is an interface for Apache Spark in Python. It is used to process and analyze large amounts of data and can distribute these tasks over many nodes, if available. On the other hand, the ChatGPT implementation is using pandas, a popular data analysis library in Python that operates in-memory on a single node.
Let's compare the steps:
Filter books for a given year: Both implementations are filtering the data for a specific year. PySpark uses the filter() function, while pandas uses boolean indexing.
Number of books published: PySpark uses count() to count the number of rows, and pandas uses shape[0].
Total number of pages: PySpark uses the agg() function with F.sum() to calculate the sum, while pandas uses the sum() function.
Most prolific month: PySpark uses groupBy('month').count().orderBy(F.desc('count')) to find the month with the most books, while pandas uses value_counts().idxmax().
Longest book: PySpark uses orderBy(F.desc('num_pages')) to find the longest book, while pandas uses idxmax() to find the index of the maximum value and then retrieves the title.
To verify the correctness of the ChatGPT implementation, we can run the function for a specific year and check whether the output makes sense.
For example, we can run:
print(yearly_stats(2023))
And then verify:
num_books should match the total number of books published in 2023.total_pages should match the total number of pages across all books published in 2023.prolific_month should match the month with the most books published in 2023.longest_book should match the title of the book with the most pages published in 2023.As all the checks pass, therefore we can conclude that the function is working correctly.
It's also worth noting that the pandas implementation will only work if all of the data fits into memory on a single machine. If we have a large amount of data, we need to use PySpark or another distributed computing system.
# Remove whitespace from the 'name' field
df_lighter_authors = df_lighter_authors.withColumn('name', trim(col('name')))
# Drop rows where 'name' is null or empty
df_lighter_authors = df_lighter_authors.filter((col('name') != "") & (col('name').isNotNull()))
# Drop rows where 'id' is null
df_lighter_authors = df_lighter_authors.filter(col('id').isNotNull())
duplicate_authors = df_lighter_authors.groupBy("name").agg(count("id").alias("num_ids")).filter("num_ids > 1") # find the duplicate authors
duplicate_authors.count() # count the number of duplicate authors
37
duplicate_authors.show(50) # show the duplicate authors
+--------------------+-------+ | name|num_ids| +--------------------+-------+ | Julie Campbell| 2| | Peter Marshall| 2| | Catherine Jones| 2| | Joseph Fink| 2| | Peter Marshall| 2| | Paul Davies| 2| | James C.L. Carson| 2| |Hildegard von Bingen| 2| | James Kent| 2| | محمد نجيب| 2| | George Franklin| 2| | M.K. Graff| 2| | Martin Shaw| 2| | Erin Bedford| 2| |William Messner-L...| 2| | David Yates| 2| | Paul Graham| 2| | Peter Davies| 2| |Katherine Mercuri...| 2| | Dimitar Dimov| 2| | David Nelson| 2| | Q. Hayashida| 2| | Peter King| 2| | Peter Green| 2| | John Mole| 2| | Mike Lee| 2| |Christopher Phillips| 2| | Chris Lynch| 2| | Caroline Miller| 2| | Yordan Yovkov| 2| | Jackson Butch Guice| 2| | Boris Zakhoder| 2| | Peter Green| 2| | Jorge Molina| 2| | Cicerón| 2| |Robert W. Sulliva...| 2| | 小野不由美| 2| +--------------------+-------+
The names of the eponymous authors and the number of unique IDs associated with each name. Hence, the answer to the question is false.
df_lighter_books = df_lighter_books.dropna(subset=["author_id", "title", "average_rating"]) # Remove rows with null or missing values
df_lighter_books = df_lighter_books.filter(trim(col('title')) != '') # removes rows with empty title
df_lighter_books = df_lighter_books.withColumn("title", trim(col("title"))) # Trim whitespace from the title column
# List of author IDs
author_ids = [4]
# Use the function
author_books_dict = func.get_author_books(df_lighter_books,author_ids)
# Print the result
print(author_books_dict)
{4: ["The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Ultimate Hitchhiker's Guide: Five Complete Novels and One Story (Hitchhiker's Guide to the Galaxy, #1-5)", "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1-5)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy: Quandary Phase (Hitchhiker's Guide: Radio Play, #4)", "The Ultimate Hitchhiker's Guide (Hitchhiker's Guide to the Galaxy, #1-5)", "The Hitchhiker's Guide to the Galaxy: Quintessential Phase (Hitchhiker's Guide: Radio Play, #5)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Dirk Gently's Holistic Detective Agency (Dirk Gently, #1)", 'The Salmon of Doubt (Dirk Gently, #3)', "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Dirk Gently's Holistic Detective Agency (Dirk Gently, #1)", 'The Salmon of Doubt', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time (Dirk Gently, #3)', 'Salmon of Doubt', 'The Salmon Of Doubt', 'Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'The Deeper Meaning of Liff', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy, #3)", "The Restaurant at the End of the Universe (Hitchhiker's Guide to the Galaxy, #2)", 'Last Chance to See', "So Long, and Thanks for All the Fish (Hitchhiker's Guide to the Galaxy, #4)", "Mostly Brilliant (Hitchhiker's Guide, #1-5)", "Guide to the Hitchhiker's Guide to the Galaxy", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "The Restaurant at the End of the Universe (The Hitchhiker's Guide to the Galaxy, #2)", "Two Complete Novels: Dirk Gently's Holistic Detective Agency / The Long Dark Tea-time of the Soul (Dirk Gently #1-2)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Hitch-hiker's Guide to the Galaxy: The Original Radio Scripts", "Mostly Harmless (Hitchhiker's Guide, #5)", "The Hitchhiker's Trilogy (Hitchhiker's Guide, #1-5)", "The Illustrated Hitchhiker's Guide To The Galaxy", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy: The Quintessential Phase (Hitchhiker's Guide: Radio Play, #5)", "The Hitchhiker's Guide to the Galaxy Radio Scripts: Tertiary, Quandary & Quintessential Phases", "The More Than Complete Hitchhiker's Guide (Hitchhiker's Guide, #1-4 + short story)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Last Chance to See', "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'Guía del autoestopista galáctico', "The Hitchhiker's Guide to the Galaxy Omnibus (Hitchhiker's Guide, #1-4)", "The Hitchhiker's Guide to the Galaxy Omnibus (Hitchhiker's Guide, #1-4)", "Dirk Gently's Holistic Detective Agency / The Long Dark Tea-time of the Soul (Dirk Gently #1-2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'The Meaning of Liff (Meaning of Liff, #1)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy", "Mostly Harmless (Hitchhiker's Guide, #5)", "The More Than Complete Hitchhiker's Guide", 'Douglas Adams Live in Concert', "The Restaurant at the End of the Universe (Hitchhiker's Guide to the Galaxy #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'La vida, el universo y todo lo demás (Guía del autoestopista galáctico, #3)', 'So Long, and Thanks for All the Fish', "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide to the Galaxy #4)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'So long, and thanks for all the fish', 'So Long and Thanks for All the Fish', "Mostly Harmless (Hitchhiker's Guide to the Galaxy #5)", "Mostly Harmless (Hitchhiker's Guide, #5)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'So Long, and Thanks for All the Fish', 'Douglas Adams Live in Concert', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide To The Galaxy (Hitchhiker's Guide, #1-4)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'So long, and thanks for all the fish', "Mostly Harmless (Hitchhiker's Guide, #5)", 'Beau comme un aéroport (Dirk Gently, détective holistique, #2)', 'Un cheval dans la salle de bains', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'So Long, and Thanks for All the Fish', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Hitchhiker's Guide to the Galaxy Omnibus (Hitchhiker's Guide, #1-4)", 'The Deeper Meaning of Liff', 'Per Anhalter durch die Galaxis (Per Anhalter durch die Galaxis, #1-5)', 'The Original Hitchhiker Radio Scripts', 'Die Letzten ihrer Art: Eine Reise zu den aussterbenden Tieren unserer Erde', 'Der lange dunkle Fünfuhrtee der Seele (Dirk Gently, #2)', 'Le Dernier Restaurant avant la fin du monde (H2G2 #2)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Einmal Rupert und zurück (Hitchhiker's Guide to the Galaxy, #5)", "Macht's gut, und danke für den Fisch (Hitchhiker's Guide to the Galaxy, #4)", 'Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #2)', 'Globalement inoffensive (H2G2 #5)', 'Informe sobre la Tierra: Fundamentalmente inofensiva (Guía del autoestopista galáctico, #5)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy: The Trilogy of Four (Hitchhiker's Guide, #1-4)", "The Hitchhiker's Guide to the Galaxy Box Set (Hitchhiker's Guide, #1-5)", "The Hitch Hiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy: The Original Radio Scripts", "The Illustrated Hitchhiker's Guide To The Galaxy", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Mostly Harmless (Hitchhiker's Guide, #5)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'The Dirk Gently Omnibus (Dirk Gently, #1-2)', 'Dirk Gently. Agencia de investigaciones holísticas', 'Der elektrische Mönch (Dirk Gently #1)', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy: The Quandary Phase (Hitchhiker's Guide: Radio Play, #4)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "The Hitch Hiker's Guide to the Galaxy: A Trilogy in Five Parts", "La vie, l'univers et le reste", 'Lachs im Zweifel. Zum letzten Mal per Anhalter durch die Galaxis', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Mostly Harmless (Hitchhiker's Guide, #5)", "Mostly Harmless (Hitchhiker's Guide, #5)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Dirk Gently's Holisitic Detective Agency", 'So Long, and Thanks for All the Fish', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'Last Chance to See', 'The Salmon Of Doubt: Hitchhiking The Galaxy One Last Time', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Last Chance to See', "Douglas Adams at the BBC: A Celebration of the Author's Life and Work", "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'Macht’s gut, und danke für den Fisch (Per Anhalter durch die Galaxis, #4)', 'Salut, et encore merci pour le poisson (H2G2 #4)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy: The Tertiary Phase (Hitchhiker's Guide: Radio Play, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'The Meaning of Liff', "The Hitchhiker's Trilogy", 'The Meaning of Liff', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'Iras Celestiales: Una investigación de Dirk Gently (Dirk Gently #2)', 'Le Dernier Restaurant Avant la Fin du Monde (Le Guide Galactique 2)', 'The Meaning Of Liff', "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitch Hiker's Guide to the Galaxy: A Trilogy in Four Parts (Hitchhiker's Guide, #1-4)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide #4)", "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy #3)", "The Hitchhiker's Guide to the Galaxy 5 Book Set (Hitchhiker's Guide, #1-5)", "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", 'Last Chance to See', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "Dirk Gently's Holistic Detective Agency (Dirk Gently, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide #3)", 'The Long Dark Tea-time of the Soul (Dirk Gently, #2)', "The Hitchhiker's Guide to the Galaxy: The Tertiary Phase (Hitchhiker's Guide: Radio Play, #3)", "The Hitchhiker Trilogy (Hitchhiker's Guide, #1-5)", "The Hitchhiker's Guide to the Galaxy: Quintessential Phase", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy: Complete Trilogy in Five Volumes (Hitchhiker's Guide, #1-5)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy, #3)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'So Long, and Thanks for All the Fish', "Mostly Harmless (Hitchhiker's Guide, #5)", "Mostly Harmless (Hitchhiker's Guide, #5)", "The Hitchhiker's Guide to the Galaxy: Tertiary Phase (Hitchhiker's Guide: Radio Play, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy: A Trilogy in Four Parts", 'Das Leben, das Universum und der ganze Rest (Per Anhalter durch die Galaxis, #3)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Hitchhiker's Guide to the Galaxy Boxed Set (Hitchhiker's Guide, #1-3)", "The Restaurant at the End of the Universe (Hitchhiker's Guide to the Galaxy, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Dirk Gently's Holistic Detective Agency / The Long Dark Tea-time of the Soul (Dirk Gently, #1-2)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Dirk Gently's Holistic Detective Agency / The Long Dark Tea-time of the Soul (Dirk Gently #1-2)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "The Hitchhiker's Guide to the Galaxy Omnibus (Hitchhiker's Guide, #1-4)", "Dirk Gently's Holistic Detective Agency", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Linnunrata: Viisiosainen trilogia', 'Last Chance to See', 'Last Chance to See', 'Douglas Adams Live in Concert', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitch Hiker's Guide to the Galaxy: Complete Set. Complete & Unabridged", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy: The Quandary Phase (Hitchhiker's Guide: Radio Play, #4)", "The Hitchhiker's Guide To The Galaxy", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide To The Galaxy The Tertiary Phase", 'The Salmon of Doubt', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Mostly Harmless (Hitchhiker's Guide, #5)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide to the Galaxy #4)", 'So Long, and Thanks for All the Fish', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Dirk Gently's Holistic Detective Agency", 'Автостопом по Галактике. Том 1. Путеводитель туда', "Losos' somnenij", 'Salmon of Doubt (Dirk Gently, #3)', 'The Original Hitchhiker Radio Scripts', 'The Utterly Utterly Merry Comic Relief Christmas Book', 'Not 1982', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1-5 + short story)", "Life, the Universe and Everything (The Hitchhiker's Guide to the Galaxy, #3)", 'Le Guide du voyageur galactique', 'The Salmon Of Doubt: Hitchhiking The Galaxy One Last Time', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'The Deeper Meaning of Liff', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Salmon Of Doubt', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Deeper Meaning of Liff: A Dictionary of Things That There Aren't Any Words for Yet", "The Hitchhiker's Guide to the Galaxy", 'So long, and thanks for all the fish', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'Hasta luego, y gracias por el pescado (Guía del autoestopista galáctico, #4)', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Fonds de tiroir (Dirk Gently, détective holistique, #3)', 'Salut, et encore merci pour le poisson', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1-5)", "The Hitch Hiker's Guide To The Galaxy: Primary Phase", "The Hitchhiker's Guide to the Galaxy (The Original Radio Series)", "The Hitchhiker's Guide to the Galaxy: Primary & Secondary Phase", "The Hitchhiker's Guide to the Galaxy: The Quintessential Phase (Hitchhiker's Guide: Radio Play, #5)", "The Hitchhiker's Guide to the Galaxy: The Complete Trilogy of Five (Hitchhiker's Guide, #1-5)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Illustrated Hitchhiker's Guide To The Galaxy", 'Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #2)', 'Per Anhalter durch die Galaxis (Per Anhalter durch die Galaxis, #1)', 'Lachs im Zweifel. Zum letzten Mal per Anhalter durch die Galaxis', 'Das Restaurant am Ende des Universums. Hörspiel (Per Anhalter durch die Galaxis, #2)', 'Per Anhalter Durch Die Galaxis 2 Hörspiel', 'Per Anhalter durch die Galaxis', 'Per Anhalter durch die Galaxis (Per Anhalter durch die Galaxis, #1)', 'Per Anhalter durch die Galaxis (Per Anhalter durch die Galaxis, #1)', 'Das Leben, das Universum und der ganze Rest (Per Anhalter durch die Galaxis, #3)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide #1)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'So Long, and Thanks for All the Fish', 'מדריך הטרמפיסט לגלקסיה', 'Le Guide du routard galactique', 'guide galactique', 'Guide du routard galactique', 'Autostopem przez galaktykę', 'Życie, Wszechświat i cała reszta', 'Restauracja na końcu wszechświata', 'Der tiefere Sinn des Labenz (Der Sinn des Labenz #2)', 'The Deeper Meaning of Liff', "Mostly Harmless (Hitchhiker's Guide, #5)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'Der lange dunkle Fünfuhrtee der Seele', 'Der tiefere Sinn des Labenz (Der Sinn des Labenz #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Mostly Harmless (Hitchhiker's Guide, #5)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1-5)", 'Cześć, i dzięki za ryby (Autostopem przez Galaktykę #4)', 'W zasadzie niegroźna (Autostopem przez Galaktykę #5)', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "Mostly Harmless (Hitchhiker's Guide, #5)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'Le Guide galactique (coffret en 5 volumes)', 'Mañana no estarán. En busca de las más variopintas especies de animales al borde de la extinción', "Dirk Gently's holistische Detektei (Dirk Gently #1)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'Last Chance to See', 'Last Chance to See', "Hitchhikers Trilogy: Omni Edition (Hitchhiker's Guide, #1-3)", 'Hotondo Mugai', 'O Guia do Mochileiro das Galáxias (O mochileiro das galáxias, #1)', "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", "The More Than Complete Hitchhikers Guide (Hitchhiker's Guide, #1-4 + short story)", 'Per Anhalter durch die Galaxis', 'Per Anhalter durch die Galaxis/Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #1-2)', "Liftarens guide till galaxen (Hitchhiker's Guide, #1)", '宇宙クリケット大戦争 [Uchū kuriketto daisensō]', '銀河ヒッチハイク・ガイド [Ginga hitchihaiku gaido]', '宇宙の果てのレストラン [Uchū no hate no resutoran]', 'Het transgalactisch liftershandboek', 'Die Letzten ihrer Art', 'Die Letzten ihrer Art', 'Die letzten ihrer Art', 'Die Letzten ihrer Art', "La vita, l'universo e tutto quanto (Guida galattica per gli autostoppisti, #3)", "Ristorante al termine dell'Universo (Guida galattica per gli autostoppisti, #2)", 'Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #2)', 'Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #2)', 'Per Anhalter durch die Galaxis / Das Restaurant am Ende des Universums', 'Das Leben, das Universum und der ganze Rest (Per Anhalter durch die Galaxis, #3)', "Das Leben, das Universum & der ganze Rest/Mach's gut, & danke für den Fisch", 'Das Leben, das Universum und der ganze Rest (Per Anhalter durch die Galaxis, #3)', "Macht's gut und danke für den Fisch", "Macht's gut und danke für den Fisch (Per Anhalter durch die Galaxis, #4)", 'Der junge Zaphod geht auf Nummer Sicher', "Der elektrische Mönch: Dirk Gently's holistische Detektei (Dirk Gently #1)", 'Le Guide galactique', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Hitch-hiker's Guide to the Galaxy: The Original Radio Scripts", 'O restaurante no fim do universo (O Guia do Mochileiro das Galáxias, #2)', 'A Vida, o Universo e Tudo Mais (O Guia do Mochileiro das Galáxias, #3)', 'Até mais, e obrigado pelos peixes! (O Guia do Mochileiro das Galáxias, #4)', 'Praticamente inofensiva (O Guia do Mochileiro das Galáxias, #5)', 'Globalement Inoffensive', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Le Dernier Restaurant Avant La Fin Du Monderoman', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Per Anhalter durch die Galaxis', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'Per Anhalter durch die Galaxis (Per Anhalter durch die Galaxis, #1)', 'Guida galattica per gli autostoppisti', "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'The Salmon Of Doubt', 'Le guide du voyageur galactique', 'Addio e grazie per tutto il pesce (Guida galattica per gli autostoppisti, #4)', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Der tiefere Sinn des Labenz', 'Der lange dunkle Fünfuhrtee der Seele', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Hitchhikers Guide: 2 Secondary Spec.(4 Cd', 'Douglas Adams Live in Concert', 'Štoparski vodnik po Galaksiji (Štoparski vodnik po Galaksiji, #1)', 'O življenju, vesolju in sploh vsem (Štoparski vodnik po Galaksiji, #3)', 'Zbogom in hvala za vse ribe (Štoparski vodnik po Galaksiji, #4)', 'Pretežno neškodljiva (Štoparski vodnik po Galaksiji, #5)', 'Łosoś zwątpienia: Autostopem przez Galaktykę po raz ostatni', 'The Long Dark Tea-Time of the Soul: Radio Play (Dirk Gently, #2)', 'À Boleia pela Galáxia', 'Restoran na kraju svemira', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "סוכנות הבילוש של דירק ג'נטלי", 'Per Anhalter ins All: keine Panik', "Per Anhalter ins All 7 - 9: Das Sofa auf Lord's Cricket Ground", 'Le Guide Galactique', 'שעת התה הארוכה והאפלה של הנפש', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "The Restaurant at the End of the Universe (Hitchhiker's Guide 2)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'Per Anhalter durch die Galaxis/Das Restaurant am Ende des Universums', 'Per Anhalter durch die Galaxis', "Macht's gut und danke für den Fisch (Per Anhalter durch die Galaxis, #4)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "So Long, and Thanks for All the Fish (The Hitchhiker's Guide to the Galaxy, #4)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'El restaurante del fin del mundo', "The Hitchhiker's Guide to the Galaxy: The Secondary Phase (Hitchhiker's Guide: Radio Play, #2)", "The Restaurant at the End of the Universe (Hitchhiker's Guide to the Galaxy, #2)", "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy, #3)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Mostly Harmless (Hitchhiker's Guide, #5)", "The More Than Complete Hitchhikers Guide (Hitchhiker's Guide, #1-4 + short story)", 'Linnunradan käsikirja liftareille', "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide to the Galaxy, #4)", "Life, the Universe and Everything (The Hitchhiker's Guide to the Galaxy, #3)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Długi mroczny podwieczorek dusz', "Holistyczna agencja detektywistyczna Dirka Gently'ego (Dirka Gently #1)", 'Restaurant na konci vesmíru', 'Stopařův průvodce po Galaxii', 'Stopařův průvodce Galaxií (Stopařův průvodce Galaxií #1)', 'Restaurant na konci vesmíru (Stopařův průvodce Galaxií #2)', 'Život, vesmír a vůbec (Stopařův průvodce Galaxií #3)', 'Sbohem, a díky za ryby (Stopařův průvodce Galaxií #4)', 'Převážně neškodná (Stopařův průvodce Galaxií #5)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'Autostopem przez galaktykę (Autostopem przez Galaktykę, #1)', 'Restauracja na końcu wszechświata (Autostopem przez Galaktykę, #2)', 'Beau Comme Un Aéroport', 'The Salmon of Doubt (Dirk Gently, #3)', 'The Deeper Meaning of Liff', 'Ghidul Autostopistului Galactic', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Holistická detektivní kancelář Dirka Gentlyho (Dirk Gently, #1)', 'Dlouhý, temný čas svačiny duše (Dirk Gently, #2)', 'Losos pochyb', 'O Restaurante no Fim do Universo (O Guia da Galáxia para quem anda à Boleia, #2)', 'Guía del autoestopista galáctico', 'Život, vesmír a vůbec (Stopařův průvodce po Galaxii, #3)', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'Enimmäkseen harmiton (Linnunradan käsikirja liftareille #5)', 'Maailmanlopun ravintola', 'Maailmanlopun ravintola (Linnunradan käsikirja liftareille #2)', 'Linnunradan käsikirja liftareille', 'Last Chance to See', "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)", 'The Long Dark Tea-Time of the Soul: Radio Play (Dirk Gently, #2)', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'The Deeper Meaning of Liff', 'So Long, and Thanks for All the Fish', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Liftarens guide till galaxen, del 1-5', 'Per Anhalter durch die Galaxis (Per Anhalter durch die Galaxis, #1)', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #2)', "Mostly Harmless (Hitchhiker's Guide, #5)", 'Die letzten ihrer Art', 'Galaktikas ceļvedis stopētājiem (Triloģija četrās daļās, # 1)', 'Autostoperski vodič kroz galaksiju (Galaktički autostoperi, #1)', 'Das Leben, das Universum und der ganze Rest (Per Anhalter durch die Galaxis, #3)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'Restorāns Visuma galā (Triloģija četrās daļās, #2)', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'Guía do autoestopista galáctico', 'Převážně neškodná (Stopařův průvodce po Galaxii, #5)', 'Stopařův průvodce galaxií', 'Einmal Rupert und zurück (Per Anhalter durch die Galaxis, #5)', 'Liftarens guide till galaxen', 'Restaurangen vid slutet av universum', 'Livet, universum och allting', 'Ajöss och tack för fisken', 'Liftarens guide till galaxen', 'Dzīve, Visums un viss pārējais (Triloģija četrās daļās, # 3)', 'Atā, un paldies par zivīm! (Triloģija četrās daļās #4)', 'Voor het te laat is', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "Mostly Harmless (The Hitchhiker's Guide to the Galaxy, #5)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "H2 G2 - L'Integrale de la Trilogie en Cinq Volumes", 'Das Restaurant am Ende des Universums (Per Anhalter durch die Galaxis, #2)', 'The Salmon of Doubt: Hitchhiking the Galaxy One Last Time', 'Le guide du voyageur galactique (H2G2, #1)', "La Vie, l'Univers et le Reste (H2G2, #3)", 'Le dernier Restaurant avant la fin du monde', 'Salut, et encore merci pour le poisson (H2G2, #4)', 'Per Anhalter Durch Die Galaxis (Per Anhalter durch die Galaxis, #1-5)', 'Einmal Rupert und zurück (Per Anhalter durch die Galaxis, #5)', 'Възможно най-пълният пътеводител на галактическия стопаджия (Пътеводител на галактическия стопаджия, #1-5 & разказ)', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Globalement inoffensive', "Mostly Harmless (Hitchhiker's Guide, #5)", "The Hitchhiker's Guide to the Galaxy Collection (Hitchhiker's Guide, #1-5)", 'Restaurant na konci vesmíru (Stopařův průvodce Galaxií, #2)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'Einmal Rupert und zurück', 'Per Anhalter ins All 1/2: Schluß mit der Erde / Roboter und Doppelkopf', 'Per Anhalter ins All 5/6: Ein Tango am Ende der Welt / Die Erde hat uns wieder', "Mostly Harmless (Hitchhiker's Guide, #5)", 'Pöidlaküüdi reisijuht Galaktikas (Pöidlaküüdi reisijuht, #1)', 'Den komplette haikerens guide til galaksen', 'La vida, el universo y todo lo demás (Guía del autoestopista galáctico, #3)', 'Stopárov sprievodca galaxiou (Stopárov sprievodca galaxiou, #1)', 'Převážně neškodná (Stopařův průvodce galaxií, #5)', "Per Anhalter durch die Galaxis (Hitchhiker's Guide, #1)", 'Praticamente innocuo (Guida galattica per gli autostoppisti, #5)', 'O Guia do Mochileiro das Galáxias (O mochileiro das galáxias, #1)', 'O Restaurante no Fim do Universo (O Mochileiro das Galáxias, #2)', 'Até mais, e obrigado pelos peixes! (O mochileiro das galáxias, #4)', 'Praticamente inofensiva (O mochileiro das galáxias, #5)', 'Voor Het Te Laat Is', "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", 'Trylogia autostopem przez galaktykę (Autostopem przez Galaktykę, #1-3)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'So Long, and Thanks for All the Fish', 'Life, the Universe and Everything', "Ristorante al termine dell'universo", 'Life, the Universe and Everything', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'So Long, and Thanks for All the Fish', 'The Restaurant at the End of the Universe', 'Mostly Harmless', 'The Long Dark Tea-Time Of The Soul (Dirk Gently #2)', 'Przewodnik "Autostopem przez Galaktykę" (Autostopem przez Galaktykę, #1)', 'Restauracja na końcu wszechświata (Autostopem przez Galaktykę, #2)', 'Życie, wszechświat i cała reszta (Autostopem przez Galaktykę, #3)', 'Douglas Adams, 4 Vols. (Boxed)', 'Автостопом по Галактике. Ресторан "У конца вселенной"', 'Guida galattica per gli autostoppisti', "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", 'Sielun pitkä pimeä teehetki', 'O Restaurante no Fim do Universo (O Mochileiro das Galáxias, #2)', 'A Vida, o Universo e Tudo Mais (O Mochileiro das Galáxias, #3)', 'Elämä, maailmankaikkeus ja kaikki (Linnunradan käsikirja liftareille #3)', 'O Guia do Mochileiro das Galáxias (O Mochileiro das Galáxias, #1)', 'Até mais, e obrigado pelos peixes! (O mochileiro das galáxias, #4)', 'Praticamente Inofensiva (O Mochileiro das Galáxias, #5)', 'Guida galattica per gli autostoppisti (Guida galattica per gli autostoppisti #1-5)', 'Dlouhý, temný čas svačiny duše (Dirk Gently #2)', "Liftarens Guide till Galaxen (Hitchhiker's Guide, #1-4)", 'Il salmone del dubbio', 'Dirk Gently agenzia di investigazione olistica (Dirk Gently #1)', 'Otostopçunun Galaksi Rehberi (Otostopçu, #1)', 'Het leven, het heelal en de rest (Het Transgalactisch Liftershandboek #3)', "Het transgalactische liftershandboek (Hitchhiker's Guide to the Galaxy #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Hitchhiker's Guide to the Galaxy", "The Hitch Hiker's Guide to the Galaxy (The Hitchhiker's Guide to the Galaxy #1-5)", "Γυρίστε τον γαλαξία με ωτοστόπ (Hitchiker's Guide, #1)", "Per Anhalter durch die Galaxis (Hitchhiker's Guide, #1)", 'Guida galattica per gli autostoppisti', 'Praticamente innocuo (Guida galattica per gli autostoppisti #5)', "คู่มือท่องกาแล็กซีฉบับนักโบก (The Hitchhiker's Guide to the Galaxy)", 'ภัตตาคารสุดปลายทางจักรวาล (The Restaurant at the End of the Universe)', 'Den Komplette Guide til Galaksen', "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", 'Mostly Harmless', 'The Long Dark Tea-time of the Soul (Dirk Gently, #2)', 'Дърк Джентли: Холистична детективска агенция / Дългият мрачен следобеден чай на душата', 'Life, the Universe and Everything', 'Ghidul autostopistului galactic', 'Kuşkucu Somon: Galakside Son Bir Kez Otostop Çekmek (Dirk Gently #3)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide to the Galaxy, #2)", 'Life, the Universe, and Everything', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'So Long, and Thanks for All the Fish', "Mostly Harmless (Hitchhiker's Guide, #5)", 'Den komplette haikerens guide til galaksen', 'Sielun pitkä pimeä teehetki (Dirk Gently, #2)', 'Cześć, i dzięki za ryby', 'W zasadzie niegroźna', "La vita, l'universo e tutto quanto", 'Vendéglő a világ végén', 'Galaxis útikalauz stopposoknak', 'The Meaning of Liff', "Dirk Gently's holistisch detectivebureau (Dirk Gently #1)", "The Lost Chapters of the Hitchhiker's Guide to the Galaxy", 'Ještě je můžeme vidět... (Výpravy za vymírajícími živočichy)', 'Evrenin Sonundaki Restoran (Otostopçu 2. Kitap)', 'Elveda ve Bütün O Balıklar İçin Teşekkürler (Otostopçu 4. Kitap)', 'Çoğunlukla Zararsız (Otostopçu 5. Kitap)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Autostoperski vodič kroz galaksiju (Autostoperski vodič #1-5)', 'Linnunradan käsikirja liftareille (Linnunradan käsikirja liftareille #1)', 'Terve, ja kiitos kaloista (Linnunradan käsikirja liftareille #4)', 'Dirk Gentlyn holistinen etsivätoimisto (Dirk Gently #1)', 'The Long Dark Tea-time of the Soul (Dirk Gently, #2)', 'Holistična detektivska pisarna Dirka Gentlyja (Dirka Gently #1)', 'Štoparski vodnik po Galaksiji: trilogija v štirih knjigah (#1-4)', 'Evrenin Sonundaki Restoran (Otostopçu, #2)', 'Hayat, Evren ve Her Şey (Otostopçu 3. Kitap)', "The Hitchhiker's Guide to the Galaxy: Douglas Adams Live in Concert", 'Ostatnia okazja, by ujrzeć', "Mostly Harmless (Hitchhiker's Guide, #5)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'Vodič kroz galaksiju za autostopere #2', 'Tot ziens en bedankt voor de vis (Het Transgalactisch Liftershandboek #4)', 'Addio, e grazie per tutto il pesce', 'Il salmone del dubbio', "Guida Galattica per gli Autostoppisti (Hitchhiker's Guide, #1)", 'A Vida, o Universo e Tudo Mais (O Mochileiro das Galáxias, #3)', 'Last Chance to See', 'I stort sett menlös (I liftarens guide-trilogin #5)', 'La vida, el universo y todo lo demás (Guía del autoestopista galáctico, #3)', 'Lifarens guide till galaxen del 1-5', 'Ο Ντερκ Τζέντλυ ακολουθεί τη γάτα', "The Hitchhiker's Guide to the Galaxy: A Trilogy in Four Parts (Hitchhiker's Guide, #1-4)", "The Hitchhiker's Guide To The Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Το ρεστωράν στο τέλος του Σύμπαντος', "Αντίο κι ευχαριστώ για τα ψάρια (Hitchiker's Guide, #4)", 'Μάλλον ακίνδυνος', "Η ζωή, το σύμπαν και τα πάντα (Hitchiker's Guide, #3)", 'Γυρίστε το γαλαξία με ώτο-στοπ', "La lunga oscura pausa caffè dell'anima", 'Автостопом по галактике', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'Дългият, мрачен следобеден чай на душата', 'これが見納め―絶滅危惧の生きものたち、最後の光景', 'The Private Life of Genghis Khan', "The Restaurant at the End of the Universe (Hitchhiker's Guide to the Galaxy, #2)", 'Skrivnostni zmenek z davno pozabljenimi bogovi', 'O Guia do Mochileiro das Galáxias (O mochileiro das galáxias, #1)', 'Hasta luego, y gracias por el pescado (Guía del autoestopista galáctico, #4)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'Het Transgalactisch Liftershandboek (Het Transgalactisch Liftershandboek #1)', 'O restaurante no fim do universo (O mochileiro das galáxias, #2)', 'Ruhun Uzun Karanlık Çay Saati', "Mostly Harmless (Hitchhiker's Guide #5)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe, and Everything (Hitchhiker's Guide, #3)", 'Reštaurácia na konci vesmíru (Stopárov sprievodca galaxiou, #2)', 'Život, vesmír a všetko (Stopárov sprievodca galaxiou, #3)', 'Zbohom a ďakujeme za ryby (Stopárov sprievodca galaxiou, #4)', 'Zväčša neškodná (Stopárov sprievodca galaxiou, #5)', 'So Long, and Thanks for All the Fish', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "The Hitch Hiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1)", "Mostly Harmless (Hitchhiker's Guide, #5)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", "Life, the Universe, and Everything (Hitchhiker's Guide #3)", "The Hitchhiker's Guide to the Galaxy: Secondary Phase", 'Dugo, mračno poslijepodne duše', 'Holistička detektivska agencija Dirka Gentlyja (Dirka Gently #1)', 'Vodič kroz galaksiju za autostopere', 'Život, univerzum i sve ostalo', 'Doviđenja, i hvala za sve one ribe', 'Uglavnom bezopasni', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Complete Hitchhiker's Guide to the Galaxy: The Trilogy of Five (Hitchhiker's Guide, #1-5)", "La vita, l'universo e tutto quanto (Guida galattica per gli autostoppisti, #3)", 'Addio, e grazie per tutto il pesce', 'Praticamente innocuo', 'Guida galattica per gli autostoppisti', "Ristorante al termine dell'Universo", "Ajöss och tack för fisken (Hitchhiker's Guide, #4)", 'Tedags för dystra själar (Dirk Gently, #2)', "Livet universum och allting (Hitchhiker's Guide, #3)", 'Dirk Gentlys holistiska detektivbyrå (Dirk Gently, #1)', 'Sbohem a dík za ryby (Stopařův průvodce po Galaxii, #4)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "Η ζωή, το σύμπαν και τα πάντα (Hitchhiker's Guide, #3)", 'Dirk Gently holisztikus nyomozóirodája (Dirk Gently #1)', "Ristorante al termine dell'Universo (Guida Galattica per autostoppisti, #2)", 'Guida galattica per gli autostoppisti', "La vita, l'Universo e tutto quanto (Guida galattica per gli autostoppisti, #3)", 'Addio, e grazie per tutto il pesce (Guida galattica per gli autostoppisti #4)', 'Praticamente innocuo (Guida galattica per gli autostoppisti, #5)', 'ชีวิต จักรวาล และทุกสรรพสิ่ง (Life, the Universe, and Everything)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy: The Tertiary Phase", "The Hitchhiker's Guide to the Galaxy", "Το ρεστωράν στο τέλος του σύμπαντος (Hitchhiker's Guide, #2)", 'Štoparski vodnik po Galaksiji', 'Restavracija ob koncu Vesolja (Štoparski vodnik po Galaksiji, #2)', 'Losos dvoma: še zadnja štoparija po galaksiji', 'Informe sobre la Tierra: Fundamentalmente inofensiva (Guía del autoestopista galáctico, #5)', 'Restaurant at the End of the Universe', 'Life, the Universe and Everything', 'Douglas Adams-4 Vols (Boxed)', 'El restaurante del fin del mundo', 'Liftarens guide till galaxen', 'Agência de detectives holística (Dirk Gently #1)', "L'investigatore olistico Dirk Gently", 'Dirk Gently, agenzia investigativa olistica (Dirk Gently #1)', "The Hitchhiker's Guide to the Galaxy: The Trilogy of Five (Hitchhiker's Guide, #1-5)", 'Deuses à Solta (Dirk Gently, #2)', "Restaurangen vid slutet av Universum (Hitchhiker's Guide, #2)", "Livet, Universum och Allting (Hitchhiker's Guide, #3)", "Ajöss och Tack för Fisken (Hitchhiker's Guide, #4)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Dirk Gently, agenzia investigativa olistica (Dirk Gently #1)', "Life, the Universe, and Everything (Hitchhiker's Guide, #3)", 'Dirk Gently holisztikus nyomozóirodája (Dirk Gently #1)', 'Autostoperski vodič kroz galaksiju (Autostoperski vodič #1-5)', 'Elimäen kootut tarkoitukset', 'Guida galattica per gli autostoppisti', 'Håndbog for galakse-blaffere (#1)', "Ristorante al termine dell'Universo", "La vita, l'universo e tutto quanto", 'Addio, e grazie per tutto il pesce', "Otostopçu'nun Galaksi Rehberi: Beşibiryerde", 'Elimäen tarkoitus', "Ristorante al termine dell'Universo (Guida galattica per gli autostoppisti, #2)", "La vita, l'universo e tutto quanto", "Young Zaphod Plays It Safe (Hitchhiker's Guide to the Galaxy, #0.5)", 'Addio, e grazie per tutto il pesce', "Per Anhalter durch die Galaxis (Hitchhiker's Guide, #1)", 'Pöidlaküüdi reisijuht Galaktikas (Pöidlaküüdi reisijuht, #1)', 'Universumi lõpu restoran (Pöidlaküüdi reisijuht, #2)', 'Liftarens guide till galaxen', 'Holistyczna agencja detektywistyczna Dirka Gently (Dirka Gently #1)', 'Het leven, het heelal en de rest', 'Grotendeels ongevaarlijk (Het Transgalactisch Liftershandboek #5)', 'Tot ziens en bedankt voor de vis', 'Het restaurant aan het einde van het heelal', 'Het transgalactisch liftershandboek', 'Het restaurant aan het einde van het heelal', 'Grotendeels ongevaarlijk (Het Transgalactisch Liftershandboek #5)', 'Het leven, het heelal en de rest', 'Stopařův průvodce po Galaxii / Restaurant na konci vesmíru', 'Ještě jednou a naposledy stopem po galaxii aneb Losos pochybnosti', 'Holistička detektivska agencija Dirka Džentlija (Dirka Džentlija #1)', "The Ultimate Hitchhiker's Guide (Hitchhiker's Guide to the Galaxy #1-5)", 'Ruhun Uzun Karanlık Çay Saati (Dirk Gently, #2)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Guida galattica per gli autostoppisti: Il ciclo completo', 'Η ζωή το σύμπαν και τα πάντα', "Αντίο κι ευχαριστώ για τα ψάρια (Hitchhiker's Guide to the Galaxy #4)", "The Complete Hitchhiker's Guide to the Galaxy: The Trilogy of Five", 'Elu, Universum ja kõik (Pöidlaküüdi reisijuht, #3)', "The Hitchhiker's Guide to the Galaxy Further Radio Scripts: The Tertiary, Quandary and Quintessential Phases", 'The Salmon of Doubt', 'The Salmon of Doubt', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Γυρίστε τον γαλαξία με ωτοστόπ', "So Long, and Thanks for All the Fish (Hitchhiker's Guide to the Galaxy #4)", 'Last Chance to See', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'The Long Dark Tea-time of the Soul (Dirk Gently, #2)', 'The Salmon of Doubt', 'Douglas Adams liest aus seinem Reisebuch: "Die Letzten ihrer Art" und aus seiner Trilogie "Per Anhalter durch die Galaxis" (Live)', 'Elämä, maailmankaikkeus - ja kaikki', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Автостопом по галактике. Ресторан "У конца вселенной"', 'Автостопом по галактике: Опять в путь', 'Kutsal Dedektiflik Bürosu (Dirk Gently Serisi #1)', 'Doctor Who: City of Death', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Dirk Gently's Holistic Detective Agency (Dirk Gently, #1)", 'Пътеводител на галактическия стопаджия (Пътеводител на галактическия стопаджия, #1)', 'Дърк Джентли: холистична детективска агенция (Дърк Джентли #1)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'I stort sett menlös (I liftarens guide-trilogin #5)', 'De lange donkere theetijd van de ziel', 'Macht’s gut, und danke für den Fisch (Per Anhalter durch die Galaxis #4)', "Guia do Mochileiro das Galáxias (Hitchhiker's Guide to the Galaxy, #1-5)", 'Doctor Who: City of Death: The BBC Full-Cast Television Soundtrack Starring Tom Baker', 'Τουρίστες στον πλανήτη Γη', 'Zadnja priložnost', 'Het restaurant aan het einde van het heelal (Het Transgalactisch liftershandboek #2)', "Vodič kroz galaksiju za autostopere (Hitchhiker's Guide, #1-3)", 'Kutsal Dedektiflik Bürosu (Dirk Gently #1)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', 'Αρργκ! Αδιάφοροι ροκ ρυθμοί για γρήγορα κέρδη', 'Животът, Вселената и всичко останало (Пътеводител на галактическия стопаджия, #3)', 'Douglas Adams Live in Concert', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Haikerens guide til galaksen (Hitchhiker's Guide to the Galaxy, #1)", 'Štoparski vodnik po Galaksiji (Štoparski vodnik po Galaksiji, #1)', 'Grotendeels ongevaarlijk (Het Transgalactisch Liftershandboek #5)', "Linnunradan käsikirja liftareille (The Hitchhiker's Guide to the Galaxy 1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Dirk Gentlys holistiska detektivbyrå (Dirk Gently #1)', 'Пътеводител на галактическия стопаджия (Пътеводител на галактическия стопаджия, #1-2)', 'Life, the Universe, and Everything', 'Life, the Universe and Everything', 'The Restaurant at the End of the Universe', 'Сьомгата на съмнението', 'Vodič kroz galaksiju za autostopere 2 (Vodič kroz galaksiju za autostopere #4-5)', "Life, the Universe and Everything (Hitchhiker's Guide, #3)", 'Holistická detektívna agentúra Dirka Gentlyho (Dirk Gently, #1)', 'Dlhý temný súmrak bohov', 'So Long, and Thanks for All the Fish', "Het restaurant aan het eind van het heelal (Hitchhiker's Guide to the Galaxy, #2)", 'A lélek hosszú, sötét teadélutánja (Dirk Gently, #2)', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Αντίο κι Ευχαριστώ για τα ψάρια', "Μάλλον Ακίνδυνος (Hitchhiker's Guide, #5)", "The Hitchhiker's Guide to the Galaxy - A trilogy in five parts #1-5", 'Mostly Harmless', 'A kétség lazaca', "Galaxis Útikalauz stopposoknak (Hitchhiker's Guide to the Galaxy, #1-5)", 'O Guia do Mochileiro das Galáxias (O mochileiro das galáxias, #1)', "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Сбогом и благодаря за рибата (Пътеводител на галактическия стопаджия, #4)', 'Почти безобидна (Пътеводител на галактическия стопаджия, #5)', 'The Deeper Meaning of Liff', 'Salmon of Doubt: Hitchhiking the Galaxy One Last Time', "The Ultimate Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1-5)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "L'ultima occasione", 'Last Chance to See', 'Last Chance To See', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitchhiker's Trilogy", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "H2G2 L'intégrale de la trilogie en 5 volumes", 'Otostopçunun Galaksi Rehberi | beşibiryerde', 'The Salmon of Doubt', 'Kuşkucu Somon', "Hitchhiker's Guide to the Galaxy", 'Mostly Harmless', 'Restaurant at the End of the Universe', 'So Long, and Thanks for All the Fish', 'Life, the Universe and Everything', 'Last Chance to See', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Ghidul autostopistului galactic', 'Keliautojo kosmostopu vadovas po galaktiką', 'Długi mroczny podwieczorek dusz', 'Last Chance to See', "The Hitchhiker's Guide to the Galaxy", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "Life, the Universe and Everything (Hitchhiker's Trilogy)", "The Restaurant At the End of the Universe (Hitchhiker's Trilogy)", 'Haikerens guide til galaksen (Haikerens guide, #1)', 'Terve, ja kiitos kaloista', "The Hitchhiker's Guide to the Galaxy", "The Hitchhiker's Guide to the Galaxy: Further Radio Scripts", "The Hitchhiker's Guide to the Galaxy: Original Radio Scripts", "La lunga oscura pausa caffè dell'anima", 'Last Chance To See', "The Complete Hitchhiker's Guide to the Galaxy: The Trilogy of Five", 'Coleção Douglas Adams: Box Digital - O guia do mochileiro das galáxias - Coleção Completa - 5 livros em 1', 'The Meaning of Liff (Meaning of Liff, #1)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'The Salmon of Doubt (Dirk Gently 3)', "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy, #3)", 'The Restaurant at the End of the Universe (Hitchhikers Guide 2)', "So Long, and Thanks for All the Fish (Hitchhiker's Guide to the Galaxy #4)", 'Informe sobre la Tierra: Fundamentalmente inofensiva', 'The Meaning of Liff', "The Complete Hitchhiker's Guide to the Galaxy: The Trilogy of Five", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'Long Dark Tea-Time of the Soul', "Douglas Adams 5 Books set: The Hitchhiker's Guide to the Galaxy, The Restaurant at the End of the Universe, Life The Universe and Everything, So Long and Thanks for all the Fish, and Mostly Harmless", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide, #3)", "The Hitchhiker's Guide to the Galaxy: Quandary Phase", 'Dlugi mroczny podwieczorek dusz', "Mostly Harmless (Hitchhiker's Guide to the Galaxy, #5)", "Hitchhiker'S Guide To The Galaxy 5 Book Box Set By Douglas Adams", 'Još malo pa nestalo: autostoperski vodič kroz svet koji iščezava', 'Guida galattica per gli autostoppisti', 'Life, the Universe and Everything (Hitchhikers Guide 3)', 'Převážně neškodná (Stopařův průvodce galaxií #5)', "Az élet, a világmindenség meg minden (Hitchhiker's Guide, #3)", "Viszlát, és kösz a halakat! (Hitchhiker's Guide, #4)", "Jobbára ártalmatlan (Hitchhiker's Guide, #5)", "Mostly Harmless (Hitchhiker's Guide, #5)", 'Автостопом по Галактике. Ресторан «У конца Вселенной»', 'Informe sobre la Tierra: Fundamentalmente inofensiva', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'Reštaurácia na konci vesmíru', 'Život, vesmír a všetko', 'Zbohom a ďakujeme za ryby (Stopárov sprievodca galaxiou, #4)', 'Dlhý temný súmrak bohov', 'Dlouhý, temný čas svačiny duše (Dirk Gently, #2)', 'Ресторантът на края на Вселената (Пътеводител на галактическия стопаджия, #2)', 'The Hitchhikers Guide To The Galaxy', 'The Restaurant at the End of the Universe (Hitchhikers Guide 2)', 'The Restaurant at the End of the Universe', "The Hitchhiker's Guide to the Galaxy", "The Hitchhiker's Guide to the Galaxy", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", 'The Long Dark Tea-Time of the Soul', "The Hitchhiker's Guide To The Galaxy: Secondary Phase", "Complete Hitchhiker's Guide to the Galaxy", "The Restaurant at the End of the Universe (Hitchhiker's Guide, #2)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'The Long Dark Tea-Time of the Soul', "The Hitchhiker's Guide to the Galaxy", 'The Restaurant at the End of the Universe', 'Life, the Universe and Everything', 'Mostly Harmless', 'The Restaurant at the End of the Universe', 'Life, the Universe and Everything', 'So Long, and Thanks for All the Fish', "Dirk Gently's Holistic Detective Agency", 'The Long Dark Tea-time of the Soul (Dirk Gently, #2)', 'Zadnja priložnost', "Her Otostopçunun Galaksi Rehberi (Hitchhiker's Guide, #1)", "Το ρεστωράν στο τέλος του Σύμπαντος (Hitchhiker's Guide, #2)", "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy, #3)", 'Autostoperski vodič kroz Galaksiju (Autostoperski vodič #1-5)', "Life, The Universe, and Everything (Hitch Hiker's Guide to the Galaxy, #3)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'O Salmão da Dúvida', 'Долгое чаепитие (Dirk Gently, #2)', 'Ghidul autostopistului galactic', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'The Long Dark Tea-Time of the Soul', 'The Restaurant at the End of the Universe', 'Der lange dunkle Fünfuhrtee der Seele (Dirk Gently, #2)', 'O mochileiro das galáxias (O mochileiro das galáxias, #1)', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Galaxis útikalauz stopposoknak', 'Mostly Harmless (Hitchhikers Guide 5)', 'Dirk Gentlyn holistinen etsivätoimisto', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Holistična detektivska pisarna Dirka Gentlyja (Dirk Gently #1)', 'Holistická detektivní kancelář Dirka Gentlyho (Dirka Gently #1)', 'Un cheval dans la salle de bains (Dirk Gentle #1)', 'Guía del autoestopista galáctico', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Nägemiseni, ja aitäh kala eest (Pöidlaküüdi reisijuht #4)', 'Enamasti ohutu (Pöidlaküüdi reisijuht, #5)', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", "The Hitch Hiker's Guide to the Galaxy: The Nearly Definitive Edition (Hitchhiker's Guide to the Galaxy #1-5)", "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide, #1)", 'The Hitchhiker S Guide to the Galaxy: Quandary Phase', 'The Hitchhiker S Guide to the Galaxy: Quandary Phase', "Guia galàctica per a autostopistes (Hitchhiker's Guide to the Galaxy #1)", "Il salmone del dubbio: L'ultimo giro in autostop per la galassia", "The Hitchhiker's Guide to the Galaxy: A Trilogy in Four Parts", "The Hitch Hiker's Guide to the Galaxy: A Trilogy in Five Parts", 'Evrenin Sonundaki Restoran', 'Guida galattica per gli autostoppisti', "The Hitchhiker's Guide to the Galaxy", 'Дългият, мрачен следобеден чай на душата / Дърк Джентли - Холистична детективска агенция', 'Last Chance to See', "The Hitchhiker's Guide to the Galaxy: Original Radio Scripts", 'Restoran na kraju Vaseljene (Galaktički autostoperi, #2)', 'The Long Dark Tea-Time of the Soul (Dirk Gently, #2)', "Galaxis Útikalauz stopposoknak (Hitchhiker's Guide to the Galaxy #1-5)", "So Long, and Thanks for All the Fish (Hitchhiker's Guide, #4)", 'Hoşçakal Balık İçin Teşekkürler', "Guida galattica per gli autostoppisti (Hitchhiker's Guide to the Galaxy, #1)", 'Restaurangen vid slutet av universum', "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Elimäen perimmäinen tarkoitus', 'Per Anhalter durch die Galaxis 1: Das Restaurant am Ende des Universums', '宇宙の果てのレストラン [Uchū no hate no resutoran]', '銀河ヒッチハイク・ガイド [Ginga hitchihaiku gaido]', 'ほとんど無害 [Hotondo mugai]', '宇宙クリケット大戦争 [Uchū kuriketto daisensō]', 'さようなら、いままで魚をありがとう [Sayōnara imamade sakana o arigatō]', 'Автостопом по Галактике (Автостопом по Галактике, #1-2)', 'Douglas Adams Collection (Hitch-Hikers Guide to the Galaxy)', 'Duga, mračna čajanka duše', 'The Restaurant at the End of the Universe', "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)", 'Losos sumnje', 'Agência de investigações holísticas Dirk Gently (Dirk Gently #1)', 'Автостопом по галактике. Опять в путь', "Three book set, includes The Salmon of Doubt, Dirk Gently's Holistic Detective Agency, and The Long Dark Tea-Time of the Soul", "Dirk Gently's Holistic Detective Agency (Dirk Gently #1)", 'Douglas Adams: The Ultimate Collection', "Hitchhiker's (Douglas Adams' Hitchhiker's guide)", 'Stopárov sprievodca galaxiou', 'Reštaurácia na konci vesmíru', 'Zväčša neškodná', 'Guida galattica per gli autostoppisti', 'Autostoperski vodič kroz galaksiju', 'The Long Dark Tea-time of the Soul (Dirk Gently, #2)', "Hitchhiker's Guide to the Galaxy: Original Radio Scripts"]}
# Get the top 20 authors sorted by average_rating
top_20_authors = df_lighter_books.sort(col("average_rating").desc()).limit(20)
# Add a column with the length of each title
top_20_authors = top_20_authors.withColumn("title_length", length(col("title")))
# Find the title with the maximum length
longest_title_among_top_20 = top_20_authors.sort(col("title_length").desc()).select("title").first()[0]
print("Longest book title among the books of the top 20 authors is:", longest_title_among_top_20)
Longest book title among the books of the top 20 authors is: Domestic Politics and Family Absence: The Correspondence (1588-1621) of Robert Sidney, First Earl of Leicester, and Barbara Gamage Sidney, Countess of Leicester
# Add a column with the length of each title
longest_title = df_lighter_books.withColumn("title_length", length(col("title")))
# Find the title with the maximum length
longest_title_name = longest_title.sort(col("title_length").desc()).select("title").first()[0]
print("Longest book title overall is:", longest_title_name)
Longest book title overall is: A General Introduction to Domesday Book: Accompanied by Indexes of the Tenants-In-Chief, and Under-Tenants, at the Time of the Survey: As Well as of the Holders of Lands Mentioned in Domesday Anterior to the Formation of That Record: With an Abstract o...
Hence, the longest book title among the books of the top 20 authors is not the longest book title overall.
# Find the title with the minimum length
shortest_title_name = longest_title.sort(col("title_length").asc()).select("title").first()[0]
print("Shortest book title overall is:", shortest_title_name)
Shortest book title overall is: a
The shortest title being a might be a bit unusual, but it's not impossible. There are indeed books with very short titles, including one-letter titles. However, if we suspect that this might be due to an error in data entry or processing, it could be worth investigating further. But for now, we will assume that the data is correct.
Plot the top 10 most influential authors regarding their fan count and number of books. Who is the most influential author?
#open the first file and converting it into a dataframe
df_lighter_authors = pd.read_json("lighter_authors.json", lines = True)
#MY IDEA: --> in order to plot the 10 most influential authors, I first need to retrieve their fans count and the number of books they published. Since the dataset is very large,
#I "filter" it by taking only the columns I need. Therefore, I will take: the name of the author, the number of fans and the list of books.
#To answer the question with the most influential author concerning the fan numbers and book numbers, I need to order the authors.
#In order to do so, I sort them by the number of fans and break the ties by the number of books.
#note: the total number of books is retrieved by inspecting the length of the list of all the books published.
#create the smaller dataframe
df = pd.DataFrame({
'Author Name': df_lighter_authors['name'],
'Fan Count': df_lighter_authors['fans_count'],
'List of Books': df_lighter_authors['book_ids']
})
#I count the len of the books and I report it as an integer value
df['Number of Books'] = df['List of Books'].apply(len)
#now I sort them by the number of fans and break the ties by the number of books
df = df.sort_values(by=["Fan Count", "Number of Books"], ascending=[False, False]).head(10)
# Create the plot
plt.barh(df["Author Name"], df["Fan Count"])
plt.xlabel("Fan Count")
plt.ylabel("Author Name")
plt.title("Top 10 Authors with the Most Fans")
plt.show()
The most influential author is Stephen King.
note: this result counts all books in all versions and languages. This is done on purpose as I need to understand the most influential so all versions of a book in different languages and publications are crucial in understanding how the author is known across the globe
Have they published any series of books? If any, extract the longest series name among these authors.
note: this question is very unclear. are we talking about the longes name of a series or the name of the series which is the longest? very unclear. I'll try to do both
#MY IDEA --> now I need to inspect the lighter_books.json file but since the file is very very very (VERY) large, I can not open it as the other one, so i read it in chunks and then merge them.
#Furthermore, in order to work better, for every passage I will need this file, I retain only the columns I need
#(so the following code may appear again during the tasks with some variations according to my necessities).
#in this case I need to check if a writer has written a book series, so I will only need the column with their names and the columns related to the series.
#file
json_file_path = 'lighter_books.json'
#the columns I want
columns_to_read = ["author_name", "series_id","series_name","series_position"]
#chunk size
chunk_size = 20000
#empty dataframe to store result
result_df = pd.DataFrame(columns=columns_to_read)
#read the file
json_reader = pd.read_json(json_file_path, lines=True, chunksize=chunk_size)
#for every chunk append the selected columns to the dataframe I created before to store the result
for chunk in json_reader:
chunk = chunk[columns_to_read]
result_df = pd.concat([result_df, chunk], ignore_index=True)
print(result_df.head())
author_name series_id series_name series_position 0 J.K. Rowling 45175 Harry Potter 5 1 J.K. Rowling 45175 Harry Potter 1 2 J.K. Rowling 45175 Harry Potter 2 3 J.K. Rowling 45175 Harry Potter 3 4 J.K. Rowling 45175 Harry Potter 4
#MY IDEA --> to check if an author has written a series, I first write their name and check if in the column series_id there is at least one entry.
#If that is the case, it means that the author has written a series of book. If this is not the case and everything is NA, then they have not written a series of book.
#since the authors are only ten (1 of which is not an author) I will check them one by one. This may be slow and can be implemented by doing simultaneously but I have few time so
#I will proceed in this way for now and if I have time I will implement it (so you will not read this :) )
author_name_to_check = "Bill Gates"
#filter the dataframe with their name
author_books = result_df[result_df['author_name'] == author_name_to_check]
#check for NA
has_written_series = not author_books['series_id'].isna().all()
#output for me
if has_written_series:
print(f"{author_name_to_check} has written a series of books.")
else:
print(f"{author_name_to_check} has not written a series of books.")
Bill Gates has written a series of books.
#now I take all the authors and subset the dataframe using their names.
specific_authors = ["Stephen King", "Veronica Roth","Bill Gates","Colleen Hoover", "Rick Riordan", "James Patterson", "John Green", "Neil deGrasse Tyson", "Cassandra Clare","Mindy Kaling"]
specific_authors_df = result_df[result_df['author_name'].isin(specific_authors)]
#MY IDEA --> THIS IS FOR FINDING THE LONGEST SERIES AS NUMBER OF BOOKS
#I take the maximum number for each author of the column series_position. This is because the column itself store the number
#of a book of a specific series ex: Harry potter the order of the phoenix will have 5 as a number. So I just need to retrieve the greatest to retrieve the longest one.
specific_authors_df = specific_authors_df.drop_duplicates(subset=['series_name', 'series_position'])
max_series_positions = specific_authors_df.groupby('author_name')['series_position'].max()
print(max_series_positions)
author_name Cassandra Clare Sampler Colleen Hoover 3 James Patterson 9 John Green 2 Neil deGrasse Tyson 2 Rick Riordan companion book Stephen King 7C Veronica Roth 3 Name: series_position, dtype: object
Sthepen King has written a series of books. Veronica Roth has written a series of books. Colleen Hoover has written a series of books. Rick Riordan has written a series of books. James Patterson has written a series of books. John Green has written a series of books. Neil deGrasse Tyson has written a series of books. Cassandra Clare has written a series of books, Mindy Caling has not written a series of books. Bill Gates has not written a serie of books..
The author with the longest book series is: James Patterson. The Book series is: Alex Cross
#MY IDEA --> THIS IS FOR FINDING THE LONGEST SERIES AS LONGEST NAME
longest_series_name = specific_authors_df['series_name'].str.len().idxmax()
longest_series_title = specific_authors_df.loc[longest_series_name, 'series_name']
print(longest_series_title)
Percy Jackson and the Olympians: The Graphic Novels
Sthepen King has written a series of books. Veronica Roth has written a series of books. Colleen Hoover has written a series of books. Rick Riordan has written a series of books. James Patterson has written a series of books. John Green has written a series of books. Neil deGrasse Tyson has written a series of books. Cassandra Clare has written a series of books, Mindy Caling has not written a series of books. Bill Gates has not written a serie of books.
The longest series in respect to the title is: Percy Jackson and the Olympians: The Graphic Novels
How many of these authors have been published in different formats? Provide a meaningful chart on the distribution of the formats and comment on it.
#MY IDEA --> as I warned before, now you have the same code with slightly differences.
#this time I have to do exactly what I did before but selecting different columns; so I will retain the author, the book title and the format (what I am looking after)
json_file_path = 'lighter_books.json'
columns_to_read = ["author_name","format","title"]
chunk_size = 20000
result_df = pd.DataFrame(columns=columns_to_read)
json_reader = pd.read_json(json_file_path, lines=True, chunksize=chunk_size)
for chunk in json_reader:
chunk = chunk[columns_to_read]
result_df = pd.concat([result_df, chunk], ignore_index=True)
print(result_df.head())
author_name format \
0 J.K. Rowling Paperback
1 J.K. Rowling Hardcover
2 J.K. Rowling Hardcover
3 J.K. Rowling Mass Market Paperback
4 J.K. Rowling Paperback
title
0 Harry Potter and the Order of the Phoenix (Har...
1 Harry Potter and the Sorcerer's Stone (Harry P...
2 Harry Potter and the Chamber of Secrets (Harry...
3 Harry Potter and the Prisoner of Azkaban (Harr...
4 Harry Potter and the Goblet of Fire (Harry Pot...
#MY IDEA --> now I have to check the format type, so for each author I see the format type by selecting only one time to see in which way they have been ever published.
#I check the different types of format for each publisher
author_names = ["Stephen King", "Veronica Roth", "Bill Gates","Colleen Hoover", "Rick Riordan", "James Patterson", "John Green", "Neil deGrasse Tyson", "Cassandra Clare","Mindy Kaling"]
all_formats_df = pd.DataFrame(columns=['Author Name', 'Formats Published'])
#check for each author
for author_name in author_names:
#filter the dataframe for the author
author_df = result_df[result_df['author_name'] == author_name]
#group by author
grouped = author_df.groupby('author_name')
#group by format
formats_published = grouped['format'].unique().reset_index()
formats_published.columns = ['Author Name', 'Formats Published']
all_formats_df = pd.concat([all_formats_df, formats_published])
all_formats_df = all_formats_df.reset_index(drop=True)
#result
print(all_formats_df)
Author Name Formats Published 0 Stephen King [Mass Market Paperback, Paperback, , Hardcover... 1 Veronica Roth [Hardcover, , Trade Paperback, Paperback, eboo... 2 Bill Gates [, Paperback, Audio CD, Hardcover, Mass Market... 3 Colleen Hoover [ebook, Paperback, Kindle Edition, , Audiobook... 4 Rick Riordan [Hardcover, Paperback, Audio CD, Library Bindi... 5 James Patterson [Paperback, Mass Market Paperback, , Hardcover... 6 John Green [Hardcover, Paperback, Audio CD, Flexcover, eb... 7 Neil deGrasse Tyson [Paperback, Hardcover, Audio CD, Audiobook, Au... 8 Cassandra Clare [Hardcover, Audio CD, Paperback, Audiobook, Ki... 9 Mindy Kaling [Paperback, Hardcover, Audio CD, , ebook, Audi...
#MY IDEA --> after having the result, I want to investigate how many times a type of format has been used (always related to those authors)
formats_series = all_formats_df['Formats Published'].explode()
#count the occurrences of each format
format_counts = formats_series.value_counts()
#plot
plt.figure(figsize=(14, 6))
format_counts.plot(kind='bar')
plt.title('Distribution of Formats')
plt.xlabel('Format')
plt.ylabel('Count')
plt.show()
c:\Users\stefa\anaconda3\envs\example\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 21109 (\N{CJK UNIFIED IDEOGRAPH-5275}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\stefa\anaconda3\envs\example\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 20803 (\N{CJK UNIFIED IDEOGRAPH-5143}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\stefa\anaconda3\envs\example\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 25512 (\N{CJK UNIFIED IDEOGRAPH-63A8}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\stefa\anaconda3\envs\example\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 29702 (\N{CJK UNIFIED IDEOGRAPH-7406}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\stefa\anaconda3\envs\example\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 25991 (\N{CJK UNIFIED IDEOGRAPH-6587}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\stefa\anaconda3\envs\example\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 24235 (\N{CJK UNIFIED IDEOGRAPH-5EAB}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
As we can see from the plot, Audiobook, Kindle edition, Hardcover, Audio CD, Paperback, ebook, audible audio are the most used formats by these authors. Yet, we can see that there is a misentry in the database. For example we can see that Audiobook, Aubible audio, Audio CD, MP3, MP3 Book, ausio, Audio Book are all different categories in the dataframe but refer all in the same category. So there has been a poor classification when creating the dataset.
Provide information about the general response from readers (number of fans, average rating, number of reviews, etc.), divide the authors by gender, and comment about anything eventually related to “structural bias.”
#MY IDEA --> now I need again the ligher_authors.json file and convert it to a panda dataframe.
#then I take the columns I need and I create a dataframe with only the authors I need. Then I can proceed with a general response from readers.
df_lighter_authors
df = pd.DataFrame({
'Author Name': df_lighter_authors['name'],
'Fan Count': df_lighter_authors['fans_count'],
'Number of Books': df_lighter_authors['book_ids'],
'Average rating': df_lighter_authors['average_rating'],
'Works Id': df_lighter_authors["work_ids"],
"Gender": df_lighter_authors["gender"],
"reviews": df_lighter_authors["ratings_count"]
})
specific_authors = ["Stephen King", "Veronica Roth", "Bill Gates","Colleen Hoover", "Rick Riordan", "James Patterson", "John Green", "Neil deGrasse Tyson", "Cassandra Clare","Mindy Kaling"]
specific_authors_df = df[df["Author Name"].isin(specific_authors)]
print(specific_authors_df)
Author Name Fan Count \
1017 Stephen King 766035
1161 James Patterson 317040
3818 Neil deGrasse Tyson 277850
4537 Rick Riordan 339346
6397 Bill Gates 446876
27110 Cassandra Clare 238560
32358 Mindy Kaling 234647
100166 John Green 282537
169374 Veronica Roth 455358
205870 Colleen Hoover 356633
Number of Books Average rating \
1017 [11588, 830502, 149267, 10614, 10592, 43615, 1... 4.04
1161 [13145, 13148, 13137, 13152, 7494, 13139, 1313... 3.98
3818 [32191710, 20663770, 20663771, 20707975, 20663... 4.07
4537 [28187, 2120932, 28186, 4556058, 561456, 77361... 4.32
6397 [36603020, 39731947, 41619, 1081637, 41814307,... 3.85
27110 [256683, 3777732, 1582996, 7171637, 6752378, 1... 4.24
32358 [10335308, 30268522, 52435702, 38887475, 35134... 3.88
100166 [11870085, 99561, 6442769, 49750, 35504431, 65... 4.03
169374 [13335037, 11735983, 18710190, 18126198, 13615... 4.05
205870 [15717943, 27362503, 30333938, 17788401, 24445... 4.24
Works Id Gender reviews
1017 [849585, 150259, 1742269, 3230869, 1552134, 46... male 15195579
1161 [3373984, 881425, 1711194, 1829521, 141228, 88... male 6514711
3818 [52832605, 1204936, 23561, 16946049, 1323039, ... 175216
4537 [3346751, 2377629, 43554, 4551489, 2936530, 10... male 9144089
6397 [321174, 51432387, 314, 1479853, 60054895, 688... 9418
27110 [2267189, 3443248, 1575860, 6674837, 6948844, ... female 7401435
32358 [15237902, 41897766, 25273867, 60430209, 56458... 611045
100166 [16827462, 919292, 3364505, 48658, 21576687, 6... 7186569
169374 [13155899, 15524542, 15524549, 25457412, 19216... female 5739316
205870 [21389085, 43940889, 18602144, 24878172, 24878... female 2239503
#Fan Counts
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(specific_authors_df['Author Name'], specific_authors_df['Fan Count'])
plt.xlabel('Counts')
plt.title('Fan Counts')
#average ratings
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(specific_authors_df['Author Name'], specific_authors_df['Average rating'])
plt.xlabel('Counts')
plt.title('Average Rating')
#number of reviews
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(specific_authors_df['Author Name'], specific_authors_df['reviews'])
plt.xlabel('Counts')
plt.title('Reviews')
#Gender division
gender_table = pd.crosstab(index=specific_authors_df['Gender'], columns='Count')
# Rename the 'Count' column to 'Gender Count'
gender_table = gender_table.rename(columns={'Count': 'Gender Count'})
# Display the gender frequency table
print(gender_table)
col_0 Gender Count
Gender
4
female 3
male 3
#MY IDEA --> now I want to investigate the statistics
gender_groups = specific_authors_df.groupby('Gender')
avg_rating_stats = gender_groups['Average rating'].describe()
print(avg_rating_stats)
avg_rating_means = gender_groups['Average rating'].mean()
print(avg_rating_means)
avg_rating_stats = gender_groups['reviews'].describe()
print(avg_rating_stats)
avg_rating_means = gender_groups['reviews'].mean()
print(avg_rating_means)
avg_rating_stats = gender_groups['Fan Count'].describe()
print(avg_rating_stats)
avg_rating_means = gender_groups['Fan Count'].mean()
print(avg_rating_means)
count mean std min 25% 50% 75% max
Gender
4.0 3.957500 0.108743 3.85 3.8725 3.955 4.04 4.07
female 3.0 4.176667 0.109697 4.05 4.1450 4.240 4.24 4.24
male 3.0 4.113333 0.181475 3.98 4.0100 4.040 4.18 4.32
Gender
3.957500
female 4.176667
male 4.113333
Name: Average rating, dtype: float64
count mean std min 25% 50% \
Gender
4.0 1.995562e+06 3.469960e+06 9418.0 133766.5 393130.5
female 3.0 5.126751e+06 2.634922e+06 2239503.0 3989409.5 5739316.0
male 3.0 1.028479e+07 4.451435e+06 6514711.0 7829400.0 9144089.0
75% max
Gender
2254926.0 7186569.0
female 6570375.5 7401435.0
male 12169834.0 15195579.0
Gender
1.995562e+06
female 5.126751e+06
male 1.028479e+07
Name: reviews, dtype: float64
count mean std min 25% 50% \
Gender
4.0 310477.500000 93452.377942 234647.0 267049.25 280193.5
female 3.0 350183.666667 108542.796289 238560.0 297596.50 356633.0
male 3.0 474140.333333 253034.111792 317040.0 328193.00 339346.0
75% max
Gender
323621.75 446876.0
female 405995.50 455358.0
male 552690.50 766035.0
Gender
310477.500000
female 350183.666667
male 474140.333333
Name: Fan Count, dtype: float64
when exploring the data according to gender, we can see that there is a balance in Average Rating between male and female authors. yet, when analising Rewiews and Fan Counts there is a massive unbalance between genders. Male authors have more reviews and fan counts even thoug having the same average rating comparing to the female colleagues. Still, there are two authors who miss classification in gender.
For this question, consider the top 10 authors concerning the number of fans again.
#MY IDEA --> take the top 10 authors with the most fans for both files.
df = pd.DataFrame({
'Author Name': df_lighter_authors['name'],
'Fan Count': df_lighter_authors['fans_count'],
'Number of Books': df_lighter_authors['book_ids']
})
# Sort the DataFrame by 'Influence Score' in descending order
sorted_df = df.sort_values(by='Fan Count', ascending=False)
# Select the top 10 authors
top_10_authors = sorted_df.head(10)
# Print the top 10 influential authors
print(top_10_authors[['Author Name', 'Fan Count', 'Number of Books']])
# Plot the data
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(top_10_authors['Author Name'], top_10_authors['Fan Count'])
plt.xlabel('Fan Counts')
plt.title('Top 10 Authors with fans')
Author Name Fan Count \
1017 Stephen King 766035
169374 Veronica Roth 455358
6397 Bill Gates 446876
205870 Colleen Hoover 356633
4537 Rick Riordan 339346
1161 James Patterson 317040
100166 John Green 282537
3818 Neil deGrasse Tyson 277850
27110 Cassandra Clare 238560
32358 Mindy Kaling 234647
Number of Books
1017 [11588, 830502, 149267, 10614, 10592, 43615, 1...
169374 [13335037, 11735983, 18710190, 18126198, 13615...
6397 [36603020, 39731947, 41619, 1081637, 41814307,...
205870 [15717943, 27362503, 30333938, 17788401, 24445...
4537 [28187, 2120932, 28186, 4556058, 561456, 77361...
1161 [13145, 13148, 13137, 13152, 7494, 13139, 1313...
100166 [11870085, 99561, 6442769, 49750, 35504431, 65...
3818 [32191710, 20663770, 20663771, 20707975, 20663...
27110 [256683, 3777732, 1582996, 7171637, 6752378, 1...
32358 [10335308, 30268522, 52435702, 38887475, 35134...
Text(0.5, 1.0, 'Top 10 Authors with fans')
Provide the average time gap between two subsequent publications for a series of books and those not belonging to a series. What do you expect to see, and what is the actual answer to this question?
#MY IDEA --> import the dataset with only the columns I need. some of them will be needed later for other tasks
json_file_path = 'lighter_books.json'
columns_to_read = ["author_name","publication_date","original_publication_date","series_name","series_position","title","average_rating","num_pages","language","format"]
chunk_size = 20000
result_df = pd.DataFrame(columns=columns_to_read)
json_reader = pd.read_json(json_file_path, lines=True, chunksize=chunk_size)
for chunk in json_reader:
chunk = chunk[columns_to_read]
result_df = pd.concat([result_df, chunk ], ignore_index=True)
print(result_df.head())
author_name publication_date original_publication_date series_name \ 0 J.K. Rowling 2004-09 2003-06-21 Harry Potter 1 J.K. Rowling 2003-11-01 1997-06-26 Harry Potter 2 J.K. Rowling 2003-11-01 1998-07-02 Harry Potter 3 J.K. Rowling 2004-05-01 1999-07-08 Harry Potter 4 J.K. Rowling 2002-09-28 2000-07-08 Harry Potter series_position title \ 0 5 Harry Potter and the Order of the Phoenix (Har... 1 1 Harry Potter and the Sorcerer's Stone (Harry P... 2 2 Harry Potter and the Chamber of Secrets (Harry... 3 3 Harry Potter and the Prisoner of Azkaban (Harr... 4 4 Harry Potter and the Goblet of Fire (Harry Pot... average_rating num_pages language format 0 4.50 870 eng Paperback 1 4.48 309 eng Hardcover 2 4.43 352 eng Hardcover 3 4.57 435 eng Mass Market Paperback 4 4.56 734 eng Paperback
#MY IDEA --> take only the books written by these authors.
df = pd.DataFrame({
'Original Publication': result_df["original_publication_date"],
'Author Name': result_df["author_name"],
'Series Name': result_df["series_name"],
"Serie Position": result_df["series_position"],
"Title": result_df["title"],
"Average Rating":result_df["average_rating"],
"Pages":result_df["num_pages"],
"Language":result_df["language"],
"Format": result_df["format"]
})
specific_authors = ["Stephen King", "Veronica Roth","Bill Gates" ,"Colleen Hoover", "Rick Riordan", "James Patterson", "John Green", "Neil deGrasse Tyson", "Cassandra Clare","Mindy Kaling"]
books_df = df[df["Author Name"].isin(specific_authors)]
print(books_df)
Original Publication Author Name \
3114 2003-11-04 Stephen King
3178 2004-09-21 Stephen King
3179 2002 Stephen King
3180 2004-06-08 Stephen King
3181 1987-05 Stephen King
... ... ...
7025403 2013-09-16 James Patterson
7025474 1977-01-28 Stephen King
7025621 2012-10-02 Rick Riordan
7026344 2015-03-17 Cassandra Clare
7026879 2011-06-09 James Patterson
Series Name Serie Position \
3114 The Dark Tower 5
3178 The Dark Tower 7
3179 The Dark Tower 1-4
3180 The Dark Tower 6
3181 The Dark Tower 2
... ... ...
7025403 Treasure Hunters 1
7025474 The Shining 1
7025621 The Heroes of Olympus 3
7026344 Tales from the Shadowhunter Academy 2
7026879 Private 4
Title Average Rating \
3114 Wolves of the Calla (The Dark Tower, #5) 4.16
3178 The Dark Tower (The Dark Tower, #7) 4.24
3179 Dark Tower Boxed Set 4.50
3180 Song of Susannah (The Dark Tower, #6) 3.97
3181 The Drawing of the Three (The Dark Tower, #2) 4.21
... ... ...
7025403 Lovci pokladů (Treasure Hunters, #1) 4.17
7025474 Shining 4.23
7025621 Athenas tecken (Olympens hjältar, #3) 4.45
7026344 The Lost Herondale (Tales from the Shadowhunte... 4.13
7026879 Falsche Schuld (Private #4) 3.79
Pages Language Format
3114 931 eng Mass Market Paperback
3178 1050 eng Paperback
3179 Paperback
3180 544 eng Mass Market Paperback
3181 463 eng Paperback
... ... ... ...
7025403 468 cze Hardcover
7025474 494 ger Paperback
7025621 557 swe Paperback
7026344 2 eng Audio CD
7026879 ger
[16471 rows x 9 columns]
###MY IDEA --> now I filter the data such that the books with a negative number of pages or equal to zero are eliminated.
#I also remove all the different languages versions of a book
books_df['Pages'] = pd.to_numeric(books_df['Pages'], errors='coerce')
#remove 0 and negative
books_df = books_df[books_df['Pages'] > 0]
#take only english books, so that I remove all other version of the book
books_df = books_df[books_df['Language'] == "eng"]
#mY IDEA --> make the dataframe such that there are only books that belong to a series. drop all the others.
#then I need to clean the dataset by taking only ONE book per series per each position
#since I retained the original publication date, I do not care which I retain, the only thing that I care is that the date is specific and not only a year.
#at the very end, I remove all series that consists of just one book (therefore are not a series since it is only one book)
#removes all NA and empty values --> so I have only series books.
filtered_df = books_df[books_df['Series Name'].notna() | (books_df['Series Name'] != '')]
#drop all rows with "-" in the column series name (--> I'm doing so beacuse I want to remove some variables that have as value 1-4 or 1-6...)
filtered_df = filtered_df[filtered_df['Serie Position'].str.contains('-') == False]
#drop all rows that do not have two "- -". I am doing this because I am removing all dates that do not consist of year, month,day.
filtered_df = filtered_df[filtered_df['Original Publication'].str.count('-') == 2]
#now, since a lot (A LOT!) of my rows are composed by different publication of the same book, I just need one for each book for each series as I will use the
#original publication date, which is equal for all the books, even for a different publication
filtered_df = filtered_df.drop_duplicates(subset=['Series Name', 'Serie Position'])
#remove all rows with the name of a series that compares only one time
counts = filtered_df['Series Name'].value_counts()
unique_series_names = counts[counts == 1].index
filtered_df = filtered_df[filtered_df['Series Name'].isin(unique_series_names) == False]
#now I have my clean dataframe :)
filtered_df
| Original Publication | Author Name | Series Name | Serie Position | Title | Average Rating | Pages | Language | Format | |
|---|---|---|---|---|---|---|---|---|---|
| 3114 | 2003-11-04 | Stephen King | The Dark Tower | 5 | Wolves of the Calla (The Dark Tower, #5) | 4.16 | 931.0 | eng | Mass Market Paperback |
| 3178 | 2004-09-21 | Stephen King | The Dark Tower | 7 | The Dark Tower (The Dark Tower, #7) | 4.24 | 1050.0 | eng | Paperback |
| 3180 | 2004-06-08 | Stephen King | The Dark Tower | 6 | Song of Susannah (The Dark Tower, #6) | 3.97 | 544.0 | eng | Mass Market Paperback |
| 3183 | 1997-11-04 | Stephen King | The Dark Tower | 4 | Wizard and Glass (The Dark Tower, #4) | 4.23 | 845.0 | eng | Paperback |
| 3185 | 1982-06-01 | Stephen King | The Dark Tower | 1 | The Gunslinger (The Dark Tower, #1) | 3.94 | 238.0 | eng | Paperback |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6800793 | 2015-11-23 | James Patterson | Alex Cross | 23 | Cross Justice (Alex Cross, # 23) | 4.23 | 420.0 | eng | Hardcover |
| 6800795 | 2015-01-15 | James Patterson | I Funny | 3 | I Totally Funniest: A Middle School Story: (I ... | 4.38 | 336.0 | eng | Paperback |
| 6953146 | 2016-03-14 | James Patterson | Private | 10 | Private Paris (Private, #10) | 3.86 | 141.0 | eng | Hardcover |
| 6964407 | 2015-04-24 | James Patterson | Middle School | 7.25 | Middle School: Rafe's Aussie Adventure (Middle... | 4.30 | 320.0 | eng | Paperback |
| 7002014 | 2016-01-19 | Colleen Hoover | Never Never | 3 | Never Never: Part Three (Never Never, #3) | 3.49 | 130.0 | eng | Kindle Edition |
212 rows × 9 columns
#MY IDEA --> now I have to print the average time between a publication. to do so, I need to make the original publication column into a datetime format.
#then I sort it by series name and position so that I can calculate the difference between two subsequent publications for each series.
#then I take the average for each series
#eliminate all those books in which the series position is 0.5, 1.4 ecc
filtered_df = filtered_df[filtered_df['Serie Position'].str.isnumeric()]
#convert "Original Publication" column in a datetime format.
filtered_df['Original Publication'] = pd.to_datetime(filtered_df['Original Publication'])
#sort
filtered_df = filtered_df.sort_values(by=['Series Name', 'Serie Position'])
#for each series, calculate the time difference between two subsequent publications
filtered_df['Time Gap'] = filtered_df.groupby('Series Name')['Original Publication'].diff()
#average
average_time_gap = filtered_df.groupby('Series Name')['Time Gap'].mean()
print(average_time_gap)
Series Name Alex Cross 177 days 13:05:27.272727272 Confessions 464 days 16:00:00 Daniel X 556 days 04:48:00 Daniel X: The Manga 289 days 00:00:00 Divergent 455 days 12:00:00 Honeymoon 3052 days 00:00:00 Hopeless 203 days 00:00:00 I Funny 555 days 00:00:00 Maximum Ride 462 days 09:00:00 Maximum Ride: The Manga 424 days 12:00:00 Maybe NaT Michael Bennett 442 days 20:34:17.142857144 Middle School 544 days 08:00:00 NYPD Red 444 days 12:00:00 Never Never 188 days 12:00:00 Percy Jackson & Kane Chronicles Crossover 331 days 12:00:00 Percy Jackson and the Olympians 351 days 18:00:00 Private 170 days 12:00:00 Slammed 241 days 00:00:00 Tales from the Shadowhunter Academy 30 days 15:00:00 The Bane Chronicles 35 days 00:00:00 The Dark Artifices 1001 days 00:00:00 The Dark Tower 2037 days 00:00:00 The Dark Tower: Graphic Novels 541 days 00:00:00 The Green Mile 30 days 19:12:00 The Heroes of Olympus 364 days 00:00:00 The Infernal Devices 465 days 12:00:00 The Infernal Devices: Manga 287 days 00:00:00 The Kane Chronicles 364 days 00:00:00 The Last Hours 364 days 00:00:00 The Mortal Instruments 511 days 09:36:00 The Talisman 6155 days 00:00:00 Travis McKinley 6719 days 00:00:00 Treasure Hunters 350 days 00:00:00 Tres Navarre 1166 days 16:00:00 When the Wind Blows 1685 days 00:00:00 Witch & Wizard 771 days 12:00:00 Witch & Wizard: The Manga 276 days 12:00:00 Women's Murder Club 257 days 16:36:55.384615384 Zombicorns 579 days 00:00:00 Name: Time Gap, dtype: timedelta64[ns]
C:\Users\stefa\AppData\Local\Temp\ipykernel_12420\42724353.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy filtered_df['Original Publication'] = pd.to_datetime(filtered_df['Original Publication'])
print(average_time_gap.mean())
855 days 00:52:07.174364112
I did not understand the meaning of the question itself. I did not understand if I need to return the average gap time between books belonging to the same serie of books or the average gap time between all books belonging to a serie. Anyways, I provided both. The general time is 855 days.
If the time to return was the one for books belonging to the same series, then we have different times for each series. most of the publications are around one year, which is something reasonable. Few of them are above one year with some up to 8 years. This may be strange but in my opinion it is normal, just take as esample the times JRR Martin is taking for publishing a "Game of Thrones" book.
There are some publications that have an average time very short (coupple of month or less) but inspecting them, they are all graphic novels, Mangas, short stories... Yet, there is something that I would have not expect, there are few publications with time between 100 days and 355. Personally I find it strange to publish a book in few months or less than a year.
### MY IDEA --> now I have to do the inverse, so I remove all books belonging to a serie, I clean the dataset by removing outliers in the date format, drop dublicates
#caused by the publishing again of the same book and then I proceed in couning the average time
books_df
filtered_df = books_df[books_df['Series Name'].notna() & (books_df['Series Name'] == '')]
#drop all rows that do not have two "- -". I am doing this because I am removing all dates that do not consist of year, month,day.
filtered_df = filtered_df[filtered_df['Original Publication'].str.count('-') == 2]
#there are 3 data ouliers with different time format, so I drop them.
date_pattern = r'^\d{4}-\d{2}-\d{2}$'
filtered_df = filtered_df[filtered_df['Original Publication'].str.match(date_pattern)]
#now, since a lot (A LOT!) of my rows are composed by different publication of the same book, I just need to take one of them.
#So I drop duplicates and keep only one and see its original date
filtered_df = filtered_df.drop_duplicates(subset=['Title'])
#now I have my clean dataframe :)
filtered_df = filtered_df.drop('Serie Position', axis=1)
filtered_df = filtered_df.drop('Series Name', axis=1)
filtered_df
| Original Publication | Author Name | Title | Average Rating | Pages | Language | Format | |
|---|---|---|---|---|---|---|---|
| 3416 | 1978-10-03 | Stephen King | The Stand | 4.34 | 1344.0 | eng | Hardcover |
| 3425 | 1975-10-17 | Stephen King | 'Salem's Lot | 4.25 | 594.0 | eng | Hardcover |
| 3535 | 1980-05-01 | James Patterson | Cradle and All | 3.86 | 384.0 | eng | Paperback |
| 4682 | 2002-06-10 | James Patterson | The Beach House | 3.85 | 356.0 | eng | Paperback |
| 6607 | 2006-10-24 | Stephen King | Lisey's Story | 3.68 | 513.0 | eng | Hardcover |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 6504774 | 2015-11-03 | Stephen King | The Bazaar of Bad Dreams | 3.92 | 495.0 | eng | Hardcover |
| 6555083 | 2006-09-21 | John Green | The Abundance of Katherines | 3.57 | 215.0 | eng | Paperback |
| 6555457 | 2014-12-02 | Stephen King | Vertigo Essentials: American Vampire #1 | 3.88 | 64.0 | eng | Kindle Edition |
| 6767963 | 2015-06-30 | Stephen King | Drunken Fireworks | 3.53 | 2.0 | eng | Audiobook |
| 6865907 | 2015-11-10 | Colleen Hoover | November 9 | 4.34 | 320.0 | eng | Kindle Edition |
190 rows × 7 columns
### MY IDEA --> now I do again what I did before with the calculations
filtered_df['Original Publication'] = pd.to_datetime(filtered_df['Original Publication'])
# Sort the DataFrame by the "Original Publication" date
general_date_df = filtered_df.sort_values(by='Original Publication')
# Calculate the time gap between subsequent publications
general_date_df['Time Gap'] = general_date_df['Original Publication'].diff()
# Compute the average time gap
average_time_gap = general_date_df['Time Gap'].mean()
print(average_time_gap)
84 days 02:17:08.571428571
#MY IDEA --> here instead, I do what I did for the series, bur instead of doing it by series, I am doing it by authors
# Convert the "Original Publication" column to datetime
filtered_df['Original Publication'] = pd.to_datetime(filtered_df['Original Publication'])
# Sort the DataFrame by the "Original Publication" date
author_date_df = filtered_df.sort_values(by='Original Publication')
# Calculate the time gap between subsequent publications
author_date_df['Time Gap'] = author_date_df.groupby('Author Name')['Original Publication'].diff()
# Compute the average time gap by author
average_time_gap_by_author = author_date_df.groupby('Author Name')['Time Gap'].mean()
print(average_time_gap_by_author)
Author Name Bill Gates 1196 days 00:00:00 Cassandra Clare NaT Colleen Hoover 231 days 00:00:00 James Patterson 282 days 21:03:40.408163264 John Green 306 days 22:24:00 Mindy Kaling 1332 days 08:00:00 Neil deGrasse Tyson 733 days 00:00:00 Rick Riordan 4128 days 00:00:00 Stephen King 148 days 21:24:42.352941176 Veronica Roth 357 days 00:00:00 Name: Time Gap, dtype: timedelta64[ns]
As before, I did not understand well the question. So I do not know if I have to return the average between all books or all books given an author. Nevertheless, here we have both results. The average overall average time (84 days) is something very unusual but can be explained because we are taking into considerations several authors who publish books often. As before, most of the authors have a "human" time of publications, but three (Stephen King, James Patterson, Colleen Hoover) have time very short. In particular, Stephen King has the shortest time between publications.
For each of the authors, give a convenient plot showing how many books has the given author published UP TO a given year. Are these authors contemporary with each other? Can you notice a range of years where their production rate was higher?
###MY IDEA --> Now I clean the dataset again by dropping all books with the same result and by dropping all books with an incorrect date format
#I drop all books with dame title and for the series, I drop all books that have the same series name and the same series position.
#in order to do so, I will do it in two separate dataframes and then I will merge them
#then, I drop all books with the same original publication. I do this because if they have the same original publication,
#then it is the same book in a different version. I can do this because I only have 10 authors so it is very unlikely that they published a book in the same year, month ,day.
books_df
filtered_df = books_df.drop_duplicates(subset=['Title'])
#there are about 800 books with wrong datetime, I drop them
date_pattern = r'^\d{4}-\d{2}-\d{2}$'
filtered_df = filtered_df[filtered_df['Original Publication'].str.match(date_pattern)]
#filter by position removing outliers
filtered_df = filtered_df[filtered_df['Serie Position'].str.contains('-') == False]
#final filtering
filtered_df = filtered_df.drop_duplicates(subset=['Original Publication'])
#MY IDEA --> I transform the date and sort the result; then I group for each author, their books.
#convert date
filtered_df['Original Publication'] = pd.to_datetime(filtered_df['Original Publication'])
#sort
filtered_df = filtered_df.sort_values(by='Original Publication')
#count
filtered_df['Books Published'] = filtered_df.groupby('Author Name')['Original Publication'].cumcount() + 1
#plot
authors = filtered_df['Author Name'].unique()
for author in authors:
author_df = filtered_df[filtered_df['Author Name'] == author]
plt.plot(author_df['Original Publication'], author_df['Books Published'], label=author)
plt.xlabel('Time')
plt.ylabel('Total Books Published')
plt.title('Total Books Published by Author')
plt.legend(loc='best')
plt.grid(True)
plt.show()
The authors are more or less all contemporary to each other expect for Stephen King and James Patterson. These two are contemporary to the others in the sense that they still publish books when the other started publishing but as we can see, King and Patterson started way before the others but continued to publish while the other started.
df_rating_dist = df_lighter_books.select("rating_dist").toPandas() # convert the rating_dist column to a pandas dataframe
df_rating_dist.head() # print the first 5 rows of the dataframe
| rating_5 | rating_4 | rating_3 | rating_2 | rating_1 | total | |
|---|---|---|---|---|---|---|
| 0 | 1674064 | 664833 | 231195 | 41699 | 16215 | 2628006 |
| 1 | 4801606 | 1681521 | 623286 | 145898 | 125040 | 7377351 |
| 2 | 1690166 | 781011 | 313727 | 54687 | 15453 | 2855044 |
| 3 | 1994597 | 696545 | 212678 | 28915 | 13959 | 2946694 |
| 4 | 1808039 | 663849 | 193604 | 27759 | 12425 | 2705676 |
# Assuming df_rating_dist is your dataframe
df_rating_dist = func.preprocess(df_rating_dist)
df_rating_dist.head() # print the first 5 rows of the dataframe
| rating_5 | rating_4 | rating_3 | rating_2 | rating_1 | total | |
|---|---|---|---|---|---|---|
| 0 | 1674064 | 664833 | 231195 | 41699 | 16215 | 2628006 |
| 1 | 4801606 | 1681521 | 623286 | 145898 | 125040 | 7377351 |
| 2 | 1690166 | 781011 | 313727 | 54687 | 15453 | 2855044 |
| 3 | 1994597 | 696545 | 212678 | 28915 | 13959 | 2946694 |
| 4 | 1808039 | 663849 | 193604 | 27759 | 12425 | 2705676 |
df_rating_dist.describe() # print the summary statistics of the dataframe
| rating_5 | rating_4 | rating_3 | rating_2 | rating_1 | total | |
|---|---|---|---|---|---|---|
| count | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 |
| mean | 8.006048e+03 | 6.271244e+03 | 3.635023e+03 | 1.049196e+03 | 4.970985e+02 | 1.945861e+04 |
| std | 7.733876e+04 | 4.876856e+04 | 2.701399e+04 | 8.667411e+03 | 5.197083e+03 | 1.609008e+05 |
| min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 |
| 50% | 1.300000e+01 | 1.400000e+01 | 1.100000e+01 | 3.000000e+00 | 1.000000e+00 | 4.500000e+01 |
| 75% | 2.320000e+02 | 2.620000e+02 | 1.830000e+02 | 4.600000e+01 | 1.500000e+01 | 7.710000e+02 |
| max | 4.816896e+06 | 1.993815e+06 | 1.051455e+06 | 5.605970e+05 | 5.625020e+05 | 7.400639e+06 |
for col in df_rating_dist.columns: # for each column in the dataframe
df_rating_dist.loc[df_rating_dist[col] < 0, col] = 0 # replace negative values with 0
df_rating_dist.describe() # print the summary statistics of the dataframe
| rating_5 | rating_4 | rating_3 | rating_2 | rating_1 | total | |
|---|---|---|---|---|---|---|
| count | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 | 7.027431e+06 |
| mean | 8.006048e+03 | 6.271244e+03 | 3.635023e+03 | 1.049196e+03 | 4.970985e+02 | 1.945861e+04 |
| std | 7.733876e+04 | 4.876856e+04 | 2.701399e+04 | 8.667411e+03 | 5.197083e+03 | 1.609008e+05 |
| min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 |
| 50% | 1.300000e+01 | 1.400000e+01 | 1.100000e+01 | 3.000000e+00 | 1.000000e+00 | 4.500000e+01 |
| 75% | 2.320000e+02 | 2.620000e+02 | 1.830000e+02 | 4.600000e+01 | 1.500000e+01 | 7.710000e+02 |
| max | 4.816896e+06 | 1.993815e+06 | 1.051455e+06 | 5.605970e+05 | 5.625020e+05 | 7.400639e+06 |
df_rating_dist['percentage_4_above'] = (df_rating_dist['rating_5'] + df_rating_dist['rating_4']) / df_rating_dist['total'] # add a column with the percentage of ratings 4 and above
num_books_over_30 = df_rating_dist[df_rating_dist['percentage_4_above'] > 0.3].shape[0] # count the number of books with percentage of ratings 4 and above greater than 30%
total_books = df_rating_dist.shape[0] # count the total number of books
probability = num_books_over_30 / total_books # calculate the probability
probability # print the probability
0.805691866629498
The probability that a book has over 30% of the ratings above 4 is 0.805691866629498
df_lighter_books = df_lighter_books.filter(trim(col('author_id')) != '') # removes rows with empty author_id
df_lighter_books = df_lighter_books.filter(trim(col('original_publication_date')) != '') # removes rows with empty original_publication_date
df_lighter_books = df_lighter_books.dropna(subset=['original_publication_date', 'author_id']) # removes rows with null values in original_publication_date and author_id
df_within_two_years = df_lighter_books.select("author_id", "original_publication_date").toPandas() # convert the author_id and original_publication_date columns to a pandas dataframe
df_within_two_years.info() # print the first 5 rows of the dataframe
<class 'pandas.core.frame.DataFrame'> Int64Index: 6513825 entries, 0 to 6611684 Data columns (total 2 columns): # Column Dtype --- ------ ----- 0 author_id int64 1 original_publication_date datetime64[ns] dtypes: datetime64[ns](1), int64(1) memory usage: 149.1 MB
# Convert the original_publication_date column to datetime type
df_within_two_years['original_publication_date'] = pd.to_datetime(df_within_two_years['original_publication_date'], errors='coerce')
df_within_two_years = df_within_two_years.dropna(subset=['original_publication_date']) # removes rows with null values in original_publication_date
df_within_two_years.head() # print the first 5 rows of the dataframe
| author_id | original_publication_date | time_diff | time_diff_years | published_within_two_years | |
|---|---|---|---|---|---|
| 6161803 | 16244 | 1678-01-01 | 0 days | 0.0 | True |
| 18772 | 16244 | 1678-01-01 | 0 days | 0.0 | True |
| 5337206 | 18457 | 1678-01-01 | 0 days | 0.0 | True |
| 6085994 | 401826 | 1678-01-01 | 0 days | 0.0 | True |
| 5047620 | 18457 | 1678-01-01 | 0 days | 0.0 | True |
df_within_two_years = df_within_two_years.sort_values('original_publication_date', ascending=True) # sort the dataframe by original_publication_date
# Calculate the difference between each date and the previous one
df_within_two_years['time_diff'] = df_within_two_years['original_publication_date'].diff()
df_within_two_years = df_within_two_years.dropna(subset=['time_diff']) # removes rows with null values in time_diff
# Convert the difference to years
df_within_two_years['time_diff_years'] = df_within_two_years['time_diff'].dt.days / 365.25
# Find the instances where the difference is less than or equal to 2
df_within_two_years['published_within_two_years'] = df_within_two_years['time_diff_years'] <= 2
# Calculate the probability
probability = df_within_two_years['published_within_two_years'].mean()
probability # print the probability
0.9999976972052054
The probability that an author publishes a new book within two years from its last work is 0.9999976972052054
df_list_x = df_list.filter(df_list['title'] == "The Worst Books of All Time") # filter the dataframe to keep only the rows with title "The Worst Books of All Time"
df_list_x.show() # print the dataframe
+--------------------+--------------------+--------------+--------------------+--------------------+---+---------+------------+---------+---------+----------+--------------------+--------------------+
| books| created_by| created_date| description| description_html| id|num_books|num_comments|num_likes|num_pages|num_voters| tags| title|
+--------------------+--------------------+--------------+--------------------+--------------------+---+---------+------------+---------+---------+----------+--------------------+--------------------+
|[{Stephenie Meyer...|{73, Michael Econ...|May 20th, 2008|What do you think...|\n What do y...| 2| 7395| 2570| 175| 74| 18260|[abominable, abom...|The Worst Books o...|
+--------------------+--------------------+--------------+--------------------+--------------------+---+---------+------------+---------+---------+----------+--------------------+--------------------+
# Get the list of book IDs
df_flattened = df_list_x.select(explode(df_list_x.books).alias("books"))
df_book_ids = df_flattened.select("books.book_id")
df_book_ids.sort(col("book_id").desc()).show() # print the book IDs
+-------+ |book_id| +-------+ | 9998| |9996645| | 99944| | 998133| | 997668| |9972882| |9972053| |9969571| |9961659| | 99610| |9960089| | 9957| | 99561| |9954020| | 99452| | 9943| | 993849| | 99383| | 993455| | 99329| +-------+ only showing top 20 rows
# Convert df_book_ids to a list
book_ids_list = [row['book_id'] for row in df_book_ids.collect()]
# Select rows from df_lighter_books where id is in book_ids_list
df_selected = df_lighter_books.filter(df_lighter_books.id.isin(book_ids_list))
# Select id and num_pages columns
df_selected = df_selected.select('id', 'num_pages')
# sort the dataframe by num_pages in descending order
df_selected.sort(col("id").desc()).show()
+--------+---------+ | id|num_pages| +--------+---------+ |25494343| 698| |25451852| 712| |25448677| 250| |25437276| 392| |25436458| 600| |25434341| 272| |25425999| 128| |25410020| NULL| |25389154| NULL| |25324111| 486| |25304748| 275| |25278664| 336| |25263605| 256| |25261897| 368| |25257622| 288| |25251380| NULL| |25199464| NULL| |25161131| 422| |25147910| 14| |25143499| 348| +--------+---------+ only showing top 20 rows
df_selected.count() # count the number of rows
6228
df_book_ids.count() # count the number of rows
7393
# Number of books in list with >700 pages
num_books_over_700 = df_selected.filter(df_selected['num_pages'] > 700).count()
num_books_over_700 # print the number of books in list with >700 pages
220
# Total number of books with >700 pages in df_lighter_books
total_books_over_700 = df_lighter_books.filter(df_lighter_books['num_pages'] > 700).count()
total_books_over_700 # print the total number of books with >700 pages in df_lighter_books
111120
# P(book in list | book has >700 pages) = (Number of books in list with >700 pages) / (Total number of books with >700 pages)
probability = num_books_over_700 / total_books_over_700
# Print the probability
probability
0.0019798416126709864
The probability of a book being included in this list, knowing it has more than 700 pages is 0.0019798416126709864
In probability theory, two events are said to be independent if the probability of one event occurring does not affect the probability of the other event. Formally, events X and Y are independent if and only if P(X ∩ Y) = P(X)P(Y).
P(X): The probability of a book being included in "The Worst Books of All Time" list, which can be calculated by dividing the number of books in the list by the total number of books considered.
P(Y): The probability of a book having more than 700 pages, which can be calculated by dividing the number of books with more than 700 pages by the total number of books considered.
P(X ∩ Y): The probability of both events happening, i.e., a book being in the list and having more than 700 pages. This can be calculated by dividing the number of books in the list that have more than 700 pages by the total number of books considered.
If P(X ∩ Y) equals P(X)P(Y), then the events X and Y are independent. If not, they are not independent, meaning the occurrence of one event does influence the probability of the other event.
# P(X): The probability of a book being included in "The Worst Books of All Time" list, which can be calculated by dividing the number of books in the list by the total number of books considered.
px = df_selected.count() / df_lighter_books.count()
# P(Y): The probability of a book having more than 700 pages, which can be calculated by dividing the number of books with more than 700 pages by the total number of books considered.
py = df_lighter_books.filter(df_lighter_books['num_pages'] > 700).count() / df_lighter_books.count()
# P(X,Y): The probability of a book being included in "The Worst Books of All Time" list and having more than 700 pages, which can be calculated by dividing the number of books in the list with more than 700 pages by the total number of books considered.
pxy = df_selected.filter(df_selected['num_pages'] > 700).count() / df_lighter_books.count()
px # print P(X)
0.0008862413590400248
py # print P(Y)
0.015812321743180403
print(pxy) # print P(X,Y)
3.130589258009079e-05
print(px * py) # print P(X) * P(Y)
1.4013533511254334e-05
# check if P(X,Y) = P(X) * P(Y)
pxy == px * py
False
Therefore we can conclude that the events X=’Being Included in The Worst Books of All Time list’ and Y=’Having more than 700 pages’ are not independent.
It seems from the research question that we are looking for whether there is a negative correlation between the number of pages a book has and the average number of reviews it has received. What we come up with is to do a scatterplot of the two variables and see if it is possible to make a regression line that has a negative angular coefficient. We are considering both the number of pages and the average number of reviews as continuous quantitative variables. We also calculate the correlation between the two variables and see if the results we find graphically and with the formulas are consistent with our hypotheses. We can also do a Pearson test to see if the correlation is significant.
df_lighter_books = df_lighter_books.filter(trim(col("num_pages")) != "") # removes rows with empty language
df_lighter_books = df_lighter_books.filter(trim(col("average_rating")) != "")
df_lighter_books = df_lighter_books.dropna(subset=["num_pages","average_rating"])
ratings_count = df_lighter_books.select("ratings_count").toPandas()
pages_and_ratings = df_lighter_books.select("num_pages","average_rating").toPandas() #takes 5 minutes to run
# Let's look at our quantiles to eliminate a bit of outliers to permorm the scatterplot
np.max(pages_and_ratings["num_pages"])
q_pages = np.quantile(pages_and_ratings["num_pages"], q = 0.999) #we use this quantile to esclude number of pages too big
q_ratings = np.quantile(ratings_count["ratings_count"], q = 0.2)
q_min_pages = np.quantile(pages_and_ratings["num_pages"], q = 0.2)
filtered_pages_and_ratings = df_lighter_books.filter((df_lighter_books["ratings_count"] >= q_ratings) & (df_lighter_books["num_pages"] >= q_min_pages) & (df_lighter_books["num_pages"] <= q_pages)).select("num_pages", "average_rating").toPandas()
plt.scatter(filtered_pages_and_ratings["num_pages"], filtered_pages_and_ratings["average_rating"], alpha=0.1, edgecolors = 'lightsteelblue')
slope, intercept = np.polyfit(filtered_pages_and_ratings["num_pages"], filtered_pages_and_ratings["average_rating"], 1)
plt.plot(filtered_pages_and_ratings["num_pages"], slope*filtered_pages_and_ratings["num_pages"]+intercept, color = "green")
plt.title("Scatterplot of Average Ratings and Number of Pages")
plt.xlabel("Number of pages")
plt.ylabel("Average Rating")
plt.show()
rho = stats.pearsonr(filtered_pages_and_ratings["num_pages"], filtered_pages_and_ratings["average_rating"])
r = rho[0]
# What do we expect? From the plot shown above, is clear that there is some linear relationship... so we expect to see a rho that is not too high.
# At the same time we know that our sample size is very big... it is almost an asymptotic! So we expect to see a p-value really really low...
p_value = rho[1]
# And in fact:
print('The Pearson correlation coefficient is', r)
print('The corrisponding p-value is', p_value)
The Pearson correlation coefficient is 0.1978015443204692 The corrisponding p-value is 0.0
The scatterplot did not yield the result we hoped for, so to try to show that reviews become negative for books with many pages we try to turn average rating into a dichotomous variable and try to use a logistic regression
threshold = 3.5 # Under 3.5 is a bad rating
# New column "rating_binary"
df_lighter_books = df_lighter_books.withColumn("rating_binary", when(df_lighter_books["average_rating"] >= threshold, 1).otherwise(0))
pages_and_Binary = df_lighter_books.select("num_pages", "rating_binary").toPandas()
pages_and_binary_filtered = pages_and_Binary[pages_and_Binary["num_pages"] <= np.quantile(pages_and_Binary["num_pages"], q = 0.999)] #filtering the data
X = pages_and_binary_filtered["num_pages"].values.reshape(-1, 1)
y = pages_and_binary_filtered["rating_binary"].values
# Create and fit a logistic regression model
logistic_reg = LogisticRegression()
logistic_reg.fit(X, y)
# Plot the logistic curve
y_prob = logistic_reg.predict_proba(X)[:, 1]
intercept = logistic_reg.intercept_[0]
coefficient = logistic_reg.coef_[0][0]
print("The coefficient of the logistic regression is:", coefficient)
The coefficient of the logistic regression is: 0.0023039983447515986
plt.rcParams['agg.path.chunksize'] = 10000 #searched online after an error, is for visualizing lot of points
X_curve = np.linspace(min(X), max(X), 300).reshape(-1, 1) #x data
y_prob_curve = logistic_reg.predict_proba(X_curve)[:, 1] #y predictions
plt.scatter(X, y, alpha=0.6)
plt.plot(X_curve, y_prob_curve, color="green")
plt.title("Plot of the logistic curve")
Text(0.5, 1.0, 'Plot of the logistic curve')
After our attempts we can conclude that there is no negative correlation or a negative relationship between the two variables. Although linear regression summarized the behavior of the data well, with logistic regression we create a curve that seems to follow the pattern of the scatterplot better.
We therefore conclude that as the number of pages increases the reviews are higher, perhaps, after reading a really long book one tends to review it well, if the reader does not like the book maybe he stops reading it before finishing it and does not review it at all.
# From the pie-chart of the languages we have seen that most of the books are english book but if we look at the other categories we can see that there are also "en-US" and "en-GB"
# and we will put them together with "eng".
# So we want to construct a new column with English and Non-English books
df_lighter_books = df_lighter_books.withColumn("language_standard", when(col("language").isin("en-US", "en_GB", "eng"), "eng").otherwise("non-eng"))
df_languages_standard = df_lighter_books.select("title","language_standard","average_rating").toPandas()
# Now our data is ready
df_languages_standard["language_standard"].value_counts()
# So our population is composed by 812523 non-eng books and 770773 eng books.
non-eng 812523 eng 770773 Name: language_standard, dtype: int64
To compare the distribution of the average rating for English lirbs and non-English books, the T-test can be used. It is a test-parametric test used to determine whether there is a statistically significant difference between the means of two distributions. The t-test is based on three assumptions:
However, the t-test is robust in case the data are not normal but works well usually for small samples.
In our case the samples are independent because they are drawn from two different subpopulations, we can leave out the normality in the t-test and go to calculate the population variance, np.var calculates the population variance, in this case we do not need the sample variance because we have the whole population available. However from the point of view of time and calculations we often use the sample variance and use samples from the population.
random.seed(42)
# Sample the data
eng_ratings = random.sample(df_languages_standard[df_languages_standard["language_standard"] == "eng"]["average_rating"].tolist(), 1000)
non_eng_ratings = random.sample(df_languages_standard[df_languages_standard["language_standard"] == "non-eng"]["average_rating"].tolist(), 1000)
print(np.var(eng_ratings))
print(np.var(non_eng_ratings))
# our variances do not differ much, and so we can use the equal_var = True argument of the t-test
t_stat, p_value = stats.ttest_ind(eng_ratings, non_eng_ratings, equal_var=True) # Assuming equal variances and the default method is two-sided
print("The value of the t statistics is:", t_stat, "and the value of the p-value is: ", p_value)
0.7931538774999999 0.5778219951 The value of the t statistics is: -0.4610581512906193 and the value of the p-value is: 0.6448070679699869
From the result of the t-test, the p-value is greater than 0.05, so we accept the null hypothesis. Therefore, there does not seem to be a significant difference between the means of the two distributions.
We do not stop our analysis here because we have so much data available, and the t-test is usually used on small samples; an alternative to the t-test is the z-test, which precisely is used when the sample size is high. The z-test has as assumptions that:
In the literature we do not find as for the t-test the confidence that the test performs well even if the data is not normal so we have to verify it, we can do it either with a test or graphically.
stat, p = stats.kstest(eng_ratings, 'norm')
print("Kolmogorov-Smirnov statistic for eng_rating:", stat)
print("P-Value for eng_rating:", p)
stat_n, p_n = stats.kstest(non_eng_ratings, 'norm')
print("Kolmogorov-Smirnov statistic for non_eng_rating:", stat_n)
print("P-Value for non_eng_rating:", p_n)
# Both p-value are under 0.05 so our samples are not normally distributed...
# We can check it also graphically
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
axes[0].hist(eng_ratings, bins=20, alpha=0.8, label='eng', color = our_colors[0] )
axes[0].set_title('Ratings distributions for "eng" books')
axes[0].set_xlabel('Average_ratings')
axes[0].set_ylabel('Frequency')
# Plot per la distribuzione dei dati del gruppo "non-eng"
axes[1].hist(non_eng_ratings, bins=20, alpha=0.8, label='non-eng', color = our_colors[2])
axes[1].set_title('Ratings distributions for "non-eng" books')
axes[1].set_xlabel('Average_ratings')
axes[1].set_ylabel('Frequency')
Kolmogorov-Smirnov statistic for eng_rating: 0.9472074376523145 P-Value for eng_rating: 0.0 Kolmogorov-Smirnov statistic for non_eng_rating: 0.9609754114572416 P-Value for non_eng_rating: 0.0
Text(0, 0.5, 'Frequency')
As we can see both graphically and with the test, we cannot assume that the sample data are normal. Therefore, we are not going to use the Z-test.
From the graph we also see that the shape of the two distributions is almost the same, maybe the t-test we did is attentible but we still choose to do a nonparametric test as well such as the Wilcoxon-Mann-Whitney test. With a nonparametric test we make no assumptions about the distribution of the data, we will still get a test statistic and p-value but for example we will not have the confidence intervals.
We then go on to explore the test of Wilcoxon which is a non-parametric statistical test used to compare the medians of two groups.
The test assumptions are:
The null hypothesis is that the distribution of non-english average ratings books is the same as the distribution of english average ratings book.
statistic, p_value = mannwhitneyu(eng_ratings, non_eng_ratings)
print("Statistic:", statistic)
print("p-value:", p_value)
Statistic: 512281.0 p-value: 0.34154855244878934
So even the nonparametric test gives a p-value greater than 0.05 so we accept the null hypothesis and conclude that there is no statistically significant difference between the two groups.
Therefore, going beyond statistical significance we can conclude that there is no clear evidence to be able to say that language affects the distribution of average ratings, so the fact that a book is written in English or in another language does not significantly change the grade the reader will give the book.
Should one wish to make a more accurate analysis of the average rating of books written in English or another language one would also need to consider other factors besides the distribution of the average rating itself such as the genre of the book, the writer, and even the audience itself being considered.
eng = df_languages_standard[df_languages_standard["language_standard"] == "eng"]["average_rating"]
non_eng = df_languages_standard[df_languages_standard["language_standard"] == "non-eng"]["average_rating"]
fig, ax = plt.subplots(figsize=(10, 5))
bplot1 = ax.boxplot(eng, positions=[1], labels=["eng"], patch_artist=True, showmeans=True, meanline = True)
bplot2 = ax.boxplot(non_eng, positions=[2], labels=["non-eng"], patch_artist=True, showmeans=True, meanline = True)
ax.set_xlabel('Language Groups')
ax.set_ylabel('Average Rating')
for box, color in zip(bplot1['boxes'], ['dodgerblue']):
box.set_facecolor(color)
for box, color in zip(bplot2['boxes'], ['deepskyblue']):
box.set_facecolor(color)
plt.title("Boxplots of Average Rating for English and Non English books")
plt.show()
print("English books:", "\n", "Mean:" ,np.mean(eng), "\n","Mode:", mode(eng),"\n", "Median:" ,np.median(eng), "\n", "First Quartile: ", np.quantile(eng,q=0.25), "\n", "Third Quartile: ", np.quantile(eng,q=0.75))
print()
print("Non English books:", "\n", "Mean:" ,np.mean(non_eng), "\n","Mode:", mode(non_eng),"\n", "Median:" ,np.median(non_eng), "\n", "First Quartile: ", np.quantile(non_eng,q=0.25), "\n", "Third Quartile: ", np.quantile(non_eng,q=0.75))
English books: Mean: 3.7536994679367313 Mode: 4.0 Median: 3.91 First Quartile: 3.64 Third Quartile: 4.13 Non English books: Mean: 3.7176398206573826 Mode: 4.0 Median: 3.88 First Quartile: 3.6 Third Quartile: 4.1
We modified the boxplot to also show the mean using the showmean argument; the mean is the dotted line inside the box and in the classic boxplot it is not represented.
The other line inside the boxplot represents the mean, the second quartile.
The first and third quartiles are the extremes of the box, and the size of the box represents the interquartile range.
All the dots that we see are the values that are considered outliers, in fact if we look again at the book sample distribution graph we will see that most of the data are concentrated between 3 and 5 and therefore the boxplot considers all those it finds below the value 3 to be outlier data.
The other two black horizontal lines represent the minimum and maximum value of the distribution.
Having also displayed the mean in the boxplot we can also have additional information regarding the skewness of the distribution. When the mean is smaller than the median there is negative skewness i.e. the distribution has a long tail on the left, vice versa if the mean is larger than the median.
If the mean and median do not coincide, it is also concluded that you do not have a normal distribution.
The values for the two distriubutions we obtain are very similar, thus again confirming the results of the previous question.
# For this task we use the authors dataset and we clean a bit the data
df_lighter_authors = df_lighter_authors.filter(trim(col("name")) != "") # removes rows with empty name
df_lighter_authors = df_lighter_authors.filter(trim(col("fans_count")) != "") # removes rows with empty fans_count
df_lighter_authors = df_lighter_authors.filter((col("text_reviews_count") >= 0) & (col("fans_count") >= 0)) #removes negatives values
df_lighter_authors = df_lighter_authors.filter(trim(col("text_reviews_count")) != "") # removes rows with empty text_review_count
df_lighter_authors = df_lighter_authors.dropna(subset=["name","fans_count","text_reviews_count"])
fans_and_reviews = df_lighter_authors.select("name","fans_count","text_reviews_count").orderBy("fans_count", ascending=False).toPandas()
fans_and_reviews # well, Stephen King fans are not lazy!
| name | fans_count | text_reviews_count | |
|---|---|---|---|
| 0 | Stephen King | 766035 | 608956 |
| 1 | Veronica Roth | 455358 | 277033 |
| 2 | Colleen Hoover | 356633 | 240302 |
| 3 | Rick Riordan | 339346 | 354145 |
| 4 | James Patterson | 317040 | 333220 |
| ... | ... | ... | ... |
| 248369 | Peter Moore | 0 | 0 |
| 248370 | Luise Zschöttge | 0 | 0 |
| 248371 | Manuel Martínez Sospedra | 0 | 1 |
| 248372 | Evelyn Wylder | 0 | 3 |
| 248373 | Mukunda Das | 0 | 0 |
248374 rows × 3 columns
As with the first question to investigate whether there is some sort of relationship between fans_count and text_reviews_rating we can either rely on a correlation test or a predictive model such as linear regression.
We can do both and see if the results are consistent.
np.max(fans_and_reviews["fans_count"]) #maximum level = 766035
fans_99 = np.quantile(fans_and_reviews["fans_count"], q = 0.99) # 99-th percentile = 1620
# we need to trim the data
np.max(fans_and_reviews["text_reviews_count"]) #maximum level = 608956
text_99 = np.quantile(fans_and_reviews["text_reviews_count"], q = 0.99) # 99-th percentile = 6538
# Filtering the data according with the quatiles of the distributions
filtered_fans_and_reviews = fans_and_reviews[(fans_and_reviews["fans_count"] <= fans_99) & (fans_and_reviews["text_reviews_count"] <= text_99)]
plt.scatter(filtered_fans_and_reviews["fans_count"], filtered_fans_and_reviews["text_reviews_count"], alpha=0.1)
slope, intercept = np.polyfit(filtered_fans_and_reviews["fans_count"], filtered_fans_and_reviews["text_reviews_count"], 1)
plt.plot(filtered_fans_and_reviews["fans_count"], slope*filtered_fans_and_reviews["fans_count"]+intercept, color = "green")
plt.title("Scatterplot of Fans Count and Text Reviews Count")
plt.xlabel("Number of Fans")
plt.ylabel("Number of text reviews")
plt.show()
# Well... we have so much data that fitting a linear regression is not so helpfull, we can not say that from the scatter there is a perfect linear relation
# We can still try to do a predictive linear model
# Set X and Y and split the data
X = filtered_fans_and_reviews[["fans_count"]]
y = filtered_fans_and_reviews["text_reviews_count"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create the model
model_lm = LinearRegression()
# Train the model
model_lm.fit(X_train, y_train)
# Predictions
y_pred = model_lm.predict(X_test)
# Results
print("The R^2 value is:", round(r2_score(y_test, y_pred),3), "and the value of the coefficient is:", round(model_lm.coef_[0],3))
The R^2 value is: 0.499 and the value of the coefficient is: 2.735
To understand whether the model explains enough variance in the data, we can use R squared.
It is a measure that is the proportion of the variation in the dependent variable (in this case text_reviews_count) that is predictable from the independent variable (in this case fans_count).
Usually the values range from 0 to 1 where 1 corresponds to perfect predictions. In our case we obtained an R^2 of about 0.5 so the linear regression model cannot explain all the variance in the data.
The coefficient we obtain is a positive coefficient and so as the variable "fans_count" increases by one unit we will see a positive increase in the variable "text_review_count"
In this case we do not see the lazy behavior of fans, according to our model as the number of fans increases so do the number of reviews.
Let us also see the correlation value and whether this value is statistically signficant
We can use the Pearson correlation but it needs:
In our case we are assuming 2) and we checked the 3) using the quantiles of the distribution, we are not sure about the normality of our data, so we can do also the Spearmna correlation which is the same formula of Pearson Correlation but it is applied to the ranks of the variables and it is not parametric so we can skip the normality of the data.
rho = stats.pearsonr(filtered_fans_and_reviews["fans_count"], filtered_fans_and_reviews["text_reviews_count"])
r = rho[0]
p_value = rho[1]
print('The Pearson correlation coefficient is', r)
print('The corrisponding p-value is', p_value)
# The correlation is 0.7 so there is positive correlation between the variables and the coefficient is significative
The Pearson correlation coefficient is 0.7099858362419893 The corrisponding p-value is 0.0
rho = stats.spearmanr(filtered_fans_and_reviews["fans_count"], filtered_fans_and_reviews["text_reviews_count"])
r = rho[0]
p_value = rho[1]
print('The Spearman correlation coefficient is', r)
print('The corrisponding p-value is', p_value)
The Spearman correlation coefficient is 0.7656527137769905 The corrisponding p-value is 0.0
With both Spearman's and Pearson's correlation, we obtained that there is a positive correlation between the two variables. We therefore conclude that authors with more fans do not have lazy fans and they therefore write a large number of reviews.
However, it must be remembered that correlation does not imply causality.
We can then also do a statistical test by taking the 1000 most popular authors, and among the remaining authors another 1000 are sampled by simple random sampling without repetition.
best_1000 = filtered_fans_and_reviews.head(1000) #Because the data is already ordered
remaining_rows = filtered_fans_and_reviews.drop(best_1000.index)
others_1000 = remaining_rows.sample(n=1000, random_state=42)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
axes[0].hist(best_1000["text_reviews_count"], bins=20, alpha=0.8, label='best', color = our_colors[0])
axes[0].set_title("Number of text reviews for Popular Authors")
axes[0].set_xlabel("Text_Reviews_count")
axes[0].set_ylabel('Frequency')
# Plot per la distribuzione dei dati del gruppo "non-eng"
axes[1].hist(others_1000["text_reviews_count"], bins=20, alpha=0.8, label='other_authors', color = our_colors[2])
axes[1].set_title("Number of text reviews for not very Popular Authors")
axes[1].set_xlabel("Text_Reviews_count")
axes[1].set_ylabel('Frequency')
# Well the distributions are very different, what about their means? Let's do the test.
Text(0, 0.5, 'Frequency')
We then do the test to compare the averages of the two groups, the standard test method has as the null hypothesis
H0 that there is no significant difference between the number of reviews ("text_reviews_count") between the "best_1000" group and the "others_1000" group
while the alternative hypothesis
H1 is there is a significant difference between the number of reviews ("text_reviews_count") in the "best_1000" and "others_1000" groups"
statistic, p_value = mannwhitneyu(best_1000["text_reviews_count"], others_1000["text_reviews_count"], alternative="two-sided")
print("Statistic:", statistic)
print("p-value:", p_value)
Statistic: 967271.0 p-value: 5.829949649923864e-287
We then do the twosided test and have that our hypothesis are:
H0: There is no significant difference between the number of reviews ("text_reviews_count") between the group "best_1000" and the group "others_1000"
H1: The number of reviews ("text_reviews_count") in the "best_1000" group is significantly greater than the number of reviews in the "others_1000" group.
statistic, p_value = mannwhitneyu(best_1000["text_reviews_count"], others_1000["text_reviews_count"], alternative="greater")
print("Statistic:", statistic)
print("p-value:", p_value)
Statistic: 967271.0 p-value: 2.914974824961932e-287
In both tests the p-value is very small so we reject the null hypotheses and accept the alternative, there is then a difference between the two groups and we can now conclude that the number of reviews for authors who have more fans is greater than the number of reviews for authors who have fewer fans.
for example, when a statistician needs to make an estimate of a population parameter one starts by sampling a representative sample of the population, but once the parameter estimate (statistic) is obtained, how can one prove whether this parameter corresponds to the true parameter of the population that is, however, unknown?
It will be necessary to make some assumptions and run the so-called hypothesis tests, so hypothesis testing is a statistical procedure that allows us to ask a question based on sample information in order to reach a statistically significant decision.
The idea is that it always exists a situation which can be considered as the default, it is the null hypothesis H0.
On the other hand, there is the alternative scenario, that, if accepted, will change the status: the Alternative Hypothesis H1
The goal is to assess whether the evidence from the sample provides enough support to reject the null hypothesis in favor of the alternative hypothesis.
The null hypothesis H0 is the one best accepted when you are unsure of your assumptions while the alternative hypothesis totally changes the scenario.
In order to choose one's assumptions, it is necessary to introduce two main types of errors, the error of the first kind and the error of the second kind:
The error of the first type is the one considered the most important and therefore an attempt is made to minimize it.
Typically $\alpha$ takes values of 0.05 i.e., the probability of wrongly rejecting the null hypothesis is 5% . The level of alpha also corresponds to the reliability of the test because the degree of confidence intervals is 1-$\alpha$ and thus is usually 95%.
The 1-$\beta$ quantity, on the other hand, is known as the power of the test.
Of course, the goal is to have as small errors as possible, but for the same sample size as alpha decreases the value of beta increases, so it is a trade-off between these two quantities.
At this point we have chosen the test hypotheses and know that there are two types of errors, we have also chosen the alpha significance level.
We then obtain the region of acceptance the region of rejection, calculate the test statistic and check whether its value falls within the region of acceptance or rejection.
Hypothesis tests are being:
Usually to tell whether you have a bilateral or unilateral test you need to look at the shape of the alternative hypothesis, if it contains symbols such as $>$ or $<$ it will be unilateral while with the symbol $\neq$ it is a bilateral test.
Not only do you use the critical values of the test statistics to see whether or not you accept the null hypothesis but you can also use the p-value. If the p-value is less than alpha you reject the null hypothesis and accept the alternative hypothesis.
Parametric statistical methods are all methods that are based on probabilistic distributions such as the normal distribution or the student t distribution.
In contrast, non parametric are those that are not based on any probabilistic distribution. Thus in general, nonparametric tests should be preferred when the data do not distribute according to a normal, or in any case one is unable to prove it, for example, due to numerosity reduced.
In terms of assumptions for making tests we have that
Parametric tests seem to be fine in any case, why then are they not used all the time?
Because parametric tests are more sensitive and accurate and require smaller samples than nonparametric tests to achieve the same level of significativity.
We can make a table to summarize the test.
| Purpose | Parametric test | Non Parametric test |
|---|---|---|
| Test on a parameter of location (mean or median) | t-test | Wilcoxon Mann Whitney test |
| Determination of linear relationship between two variables | Pearson correlation test | Spearman correlation test |
| Test the hypothesis that two populations that both follow normal distributions have the same variance | F-test | Kruskal - Wallis test |
| Evaluate the differences between the averages of multiple groups | ANOVA | Kruskal - Wallis test |
| Determine whether the difference between a sample mean and a population mean is statistically significant | Z-test | Wilcoxon Mann Whitney test |
Using PySpark to load the authors.json dataset and filter authors with at least 100 reviews.
# start time
start_time = time.time()
# loading dataset
df = spark.read.json('lighter_authors.json')
# filtering authors with at least 100 reviews
filtered_df = df.filter(df.text_reviews_count >= 100)
filtered_df.show()
# end time
end_time = time.time()
# print the execution time
print("Time taken with PySpark: {} ms".format((end_time - start_time)*1000))
+--------------------+--------------+--------------------+----------+------+---+--------------------+--------------------+-------------+------------------+--------------------+-----------+ | about|average_rating| book_ids|fans_count|gender| id| image_url| name|ratings_count|text_reviews_count| work_ids|works_count| +--------------------+--------------+--------------------+----------+------+---+--------------------+--------------------+-------------+------------------+--------------------+-----------+ |Douglas Noël Adam...| 4.19|[386162, 13, 8695...| 19826| male| 4|https://images.gr...| Douglas Adams| 2862064| 62681|[3078186, 135328,...| 106| |William McGuire "...| 4.02|[9791, 21, 28, 24...| 16144| male| 7|https://images.gr...| Bill Bryson| 1417316| 84176|[613469, 2305997,...| 75| |Jude Fisher is th...| 4.53|[349254, 15222, 1...| 60|female| 10|https://images.gr...| Jude Fisher| 56159| 352|[17150, 808427, 2...| 14| |James Hamilton-Pa...| 3.79|[40, 9416484, 124...| 72| male| 12|https://images.gr...|James Hamilton-Pa...| 3302| 480|[4417, 14300808, ...| 45| |Mark Andrew Watso...| 3.6|[8466327, 1573996...| 179| male| 14|https://images.gr...| Mark Watson| 7979| 772|[13330815, 191093...| 61| |Edith Newbold Jon...| 3.78|[53835, 5246, 177...| 3179|female| 16|https://images.gr...| Edith Wharton| 436635| 26193|[1959512, 132919,...| 1001| |Although he was n...| 3.79|[50, 141372, 2915...| 2636| male| 18|https://images.gr...| Gary Paulsen| 503435| 34364|[1158125, 241811,...| 226| |Dale Peck (born 1...| 3.66|[6217775, 19979, ...| 99| | 20|https://images.gr...| Dale Peck| 6392| 631|[6398423, 21178, ...| 26| |Angela Knight is ...| 4.0|[1708127, 7044446...| 963| | 23|https://images.gr...| Angela Knight| 100211| 3650|[1705297, 7294256...| 99| |Delia Sherman (bo...| 3.91|[92377, 9727730, ...| 302|female| 24|https://images.gr...| Delia Sherman| 58429| 5379|[460093, 14616573...| 124| |Patricia Anne McK...| 3.99|[81069, 92755, 19...| 2206|female| 25|https://images.gr...|Patricia A. McKillip| 145505| 9638|[2934, 1711249, 7...| 96| |Anne McCaffrey wa...| 4.08|[61975, 28541, 12...| 6444|female| 26|https://images.gr...| Anne McCaffrey| 1105819| 23112|[2467422, 1171182...| 336| |The recipient of ...| 3.83|[903, 2316, 81010...| 362|female| 27|https://images.gr...|Zilpha Keatley Sn...| 53861| 3403|[1731888, 1550961...| 60| |Born in Richmond,...| 3.8|[20059, 63, 19514...| 41|female| 29|https://images.gr...| Kate Horsley| 2815| 413|[21256, 4431, 212...| 10| |<i>Librarian Note...| 3.81|[266782, 291514, ...| 423|female| 31|https://images.gr...| Elaine Cunningham| 42412| 939|[522983, 2179204,...| 99| |Eleanor Alice Bur...| 3.88|[349272, 817235, ...| 183|female| 32|https://images.gr...| Philippa Carr| 11256| 500|[3158204, 1830669...| 41| |Edward P. Jones h...| 3.85|[3319095, 67, 135...| 496| male| 33|https://images.gr...| Edward P. Jones| 48085| 4543|[2063986, 1062798...| 21| |Satyajit Das is a...| 3.73|[15818399, 71, 11...| 49| male| 36|https://images.gr...| Satyajit Das| 4367| 379|[21546017, 4436, ...| 31| |While <i>The Barr...| 3.84|[18050226, 19961,...| 141| male| 38|https://s.gr-asse...| Mark Smylie| 5188| 582|[25333032, 21160,...| 43| |John McPhee was b...| 4.16|[79, 54976, 77, 7...| 1223| male| 40|https://images.gr...| John McPhee| 52902| 4458|[518787, 53575, 1...| 63| +--------------------+--------------+--------------------+----------+------+---+--------------------+--------------------+-------------+------------------+--------------------+-----------+ only showing top 20 rows Time taken with PySpark: 6365.464210510254 ms
Using Pandas to load the authors.json dataset and filter authors with at least 100 reviews.
# start time
start_time = time.time()
# loading dataset
df = pd.read_json('lighter_authors.json' , lines=True)
# filtering authors with at least 100 reviews
filtered_df = df[df['text_reviews_count'] >= 100]
print(filtered_df)
# end time
end_time = time.time()
# print the execution time
print("Time taken with pandas: {} ms".format((end_time - start_time)*1000))
ratings_count average_rating text_reviews_count \
0 2862064 4.19 62681
1 1417316 4.02 84176
2 56159 4.53 352
3 3302 3.79 480
4 7979 3.60 772
... ... ... ...
351702 1033 4.13 203
351711 1280 4.14 101
351712 1604 4.13 191
351724 216 4.16 123
351736 2972 3.28 639
work_ids \
0 [3078186, 135328, 1877624, 74123, 3078120, 104...
1 [613469, 2305997, 940892, 2611786, 7800569, 31...
2 [17150, 808427, 20487307, 90550, 25460625, 171...
3 [4417, 14300808, 14780, 3796968, 44703121, 103...
4 [13330815, 19109351, 42306244, 72694240, 26291...
... ...
351702 [57108071, 58027060, 57920856, 66021782, 56221...
351711 [60374437, 56222956, 68555697, 60374436]
351712 [56223529]
351724 [56227117, 60792479, 66649997]
351736 [56231202, 69212002, 69360180, 71655514, 73255...
book_ids works_count \
0 [386162, 13, 8695, 8694, 6091075, 365, 569429,... 106
1 [9791, 21, 28, 24, 7507825, 27, 10538, 25, 26,... 75
2 [349254, 15222, 14833682, 15221, 18126815, 152... 14
3 [40, 9416484, 12482, 3753106, 26889789, 104764... 45
4 [8466327, 15739968, 22756778, 51026133, 260451... 61
... ... ...
351702 [35891069, 36345855, 36265767, 42366213, 35212... 10
351711 [38800220, 34950491, 44085252, 38800219] 4
351712 [34950854] 1
351724 [34952863, 39076682, 42866213] 3
351736 [34957071, 44587970, 44709275, 52416898, 52858... 8
id name gender \
0 4 Douglas Adams male
1 7 Bill Bryson male
2 10 Jude Fisher female
3 12 James Hamilton-Paterson male
4 14 Mark Watson male
... ... ... ...
351702 16771296 Holly Castillo female
351711 16771622 Noah Barnett
351712 16771714 Danielle Town female
351724 16773270 Kathleen Valenti female
351736 16775684 Emma Glass female
image_url \
0 https://images.gr-assets.com/authors/159137433...
1 https://images.gr-assets.com/authors/157859752...
2 https://images.gr-assets.com/authors/141145711...
3 https://images.gr-assets.com/authors/127051738...
4 https://images.gr-assets.com/authors/133175379...
... ...
351702 https://images.gr-assets.com/authors/152322633...
351711 https://s.gr-assets.com/assets/nophoto/user/u_...
351712 https://images.gr-assets.com/authors/151075298...
351724 https://images.gr-assets.com/authors/154698578...
351736 https://images.gr-assets.com/authors/152062484...
about fans_count
0 Douglas Noël Adams was an English author, comi... 19826
1 William McGuire "Bill" Bryson, OBE, FRS was bo... 16144
2 Jude Fisher is the pseudonym for <a href="http... 60
3 James Hamilton-Paterson's work has been transl... 72
4 Mark Andrew Watson (born 13 February 1980) is ... 179
... ... ...
351702 Holly grew up spending many lazy summer days r... 22
351711 Noah Barnett was born in Washington and grew u... 86
351712 A corporate startup attorney, DANIELLE TOWN ea... 64
351724 Kathleen Valenti has been writing for more tha... 223
351736 Emma Glass was born in Swansea. <br />She stud... 70
[64565 rows x 12 columns]
Time taken with pandas: 9315.683841705322 ms
PySpark is faster than Pandas. The difference is more evident when the dataset is larger. This is because PySpark is a distributed computing framework, while Pandas is a single-node library.
# Load the data into pyspark dataframes
df_lighter_books = spark.read.json("lighter_books.json")
df_lighter_authors = spark.read.json("lighter_authors.json")
# Perform the join operation
joined_df = df_lighter_books.join(df_lighter_authors, df_lighter_books.author_id == df_lighter_authors.id, 'left_outer')
# Count the number of books without a matching author
books_without_author = joined_df.filter(df_lighter_authors.id.isNull()).count()
# Print the result
print("Number of books without a matching author:", books_without_author)
Number of books without a matching author: 0
The number of books that don't have a match for the author is 0.
# select "title" and "description" columns from df_lighter_books
df_lighter_books = df_lighter_books.select("title", "description")
# perform data cleaning and preprocessing on the "description" column
df_lighter_books = df_lighter_books.filter(trim(col('description')) != '') # removes rows with empty description
df_lighter_books = df_lighter_books.dropna(subset=['description']) # removes rows with null values in description
df_lighter_books = df_lighter_books.withColumn('description', regexp_replace('description', '<.*?>', '')) # remove html tags
df_lighter_books = df_lighter_books.withColumn('description', regexp_replace('description', '[^\sa-zA-Z0-9]', '')) # remove special characters
df_lighter_books = df_lighter_books.withColumn('description', regexp_replace('description', '\s+', ' ')) # remove extra spaces
df_lighter_books = df_lighter_books.withColumn('description', regexp_replace('description', '^\s+|\s+$', '')) # remove leading and trailing spaces
df_lighter_books = df_lighter_books.withColumn('description', lower(col('description'))) # convert to lowercase
# perform data cleaning and preprocessing on the "title" column
df_lighter_books = df_lighter_books.filter(trim(col('title')) != '') # removes rows with empty title
df_lighter_books = df_lighter_books.dropna(subset=['title']) # removes rows with null values in title
df_lighter_books = df_lighter_books.withColumn('title', regexp_replace('title', '<.*?>', '')) # remove html tags
df_lighter_books = df_lighter_books.withColumn('title', regexp_replace('title', '[^\sa-zA-Z0-9]', '')) # remove special characters
df_lighter_books = df_lighter_books.withColumn('title', regexp_replace('title', '\s+', ' ')) # remove extra spaces
df_lighter_books = df_lighter_books.withColumn('title', regexp_replace('title', '^\s+|\s+$', '')) # remove leading and trailing spaces
df_lighter_books = df_lighter_books.withColumn('title', lower(col('title'))) # convert to lowercase
# summary statistics of the dataframe
df_lighter_books.describe().show()
+-------+--------+--------------------+ |summary| title| description| +-------+--------+--------------------+ | count| 5462026| 5462026| | mean| NaN|9.077326391804685...| | stddev| NaN| Infinity| | min| | | | max|zzzzzzzz|zzzziiiinnngggg s...| +-------+--------+--------------------+
We aim to group authors based on genres derived from the "description" field of our DataFrame, df_lighter_books. To accomplish this, we use the PySpark machine learning library to process the text and apply K-means clustering, an unsupervised learning algorithm.
We implemented the following steps in a PySpark pipeline:
Tokenization: We used PySpark's Tokenizer to divide the text into individual words, resulting in a column of arrays containing tokenized words.
Stopwords Removal: We used StopWordsRemover to remove common words that typically don't carry much meaning (like "and", "the", "a").
Count Vectorization: We converted our collections of text documents into vectors of token counts using CountVectorizer.
TF-IDF Transformation: We used IDF to calculate the Inverse Document Frequency, thereby weighing the importance of words across the documents.
K-means Clustering: Finally, we applied the K-means algorithm, which grouped authors into clusters (or genres) based on the TF-IDF features of their descriptions.
We fit our pipeline model on df_lighter_books and then transformed the DataFrame to get genre predictions. We set the number of clusters (K) to 5, although this parameter could be adjusted to refine the grouping.
After fitting the model, we extracted the cluster centers and identified the top 10 words associated with each cluster, which helped us interpret the characteristics of each cluster.
# Tokenization
tokenizer = Tokenizer(inputCol="description", outputCol="words")
# Remove stop words and special characters
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
# Count Vectorization
countVectorizer = CountVectorizer(inputCol="filtered_words", outputCol="raw_features")
# TF-IDF
idf = IDF(inputCol="raw_features", outputCol="features")
# K-means clustering
kmeans = KMeans(k=5, seed=55, featuresCol="features", predictionCol="genre")
# Create a pipeline
pipeline = Pipeline(stages=[tokenizer, remover, countVectorizer, idf, kmeans])
# Fit the pipeline model
model = pipeline.fit(df_lighter_books)
# Transform the DataFrame to get genre predictions
df_with_predictions_books = model.transform(df_lighter_books)
# Show the cluster centers
print("Cluster Centers:")
centers = model.stages[-1].clusterCenters()
for center in centers:
print(center)
Cluster Centers: [5.69728917e-01 1.96845942e-01 2.50925284e-01 ... 2.72271368e-05 2.37798960e-05 7.35027829e-05] [1.45876464e-01 4.53179969e-02 6.52285010e-03 ... 0.00000000e+00 0.00000000e+00 1.34158252e-04] [2.67832293e+01 1.50080913e+01 9.37211361e-03 ... 5.17010205e-05 0.00000000e+00 0.00000000e+00] [2.27606590e-01 9.79674109e+00 5.05809907e-03 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [8.37373816e-02 4.00035056e-02 1.03214131e+00 ... 2.71842266e-04 2.85623082e-04 1.83927173e-04]
# Show books with their corresponding genre predictions
df_with_predictions_books.select("title", "genre").show()
+--------------------+-----+ | title|genre| +--------------------+-----+ |harry potter and ...| 0| |harry potter and ...| 4| |harry potter and ...| 0| |harry potter and ...| 0| |harry potter and ...| 0| |the harry potter ...| 0| |harry potter boxe...| 0| |harry potter coll...| 0| |the hitchhikers g...| 0| |the ultimate hitc...| 4| |the ultimate hitc...| 4| |the hitchhikers g...| 0| |the hitchhikers g...| 4| |the hitchhikers g...| 0| |the ultimate hitc...| 4| |the hitchhikers g...| 0| |a short history o...| 4| |bill brysons afri...| 4| |brysons dictionar...| 0| |in a sunburned co...| 4| +--------------------+-----+ only showing top 20 rows
# Extract the vocabulary from CountVectorizer
vocabulary = model.stages[2].vocabulary
# Get the top N words from each cluster
N = 10 # Number of words per cluster
for i, center in enumerate(centers):
topN = center.argsort()[-N:][::-1]
print(f"Cluster {i}:")
for idx in topN:
print(vocabulary[idx])
Cluster 0: de book work en e one new o life world Cluster 1: und der die sie von zu ist den das sich Cluster 2: de la que y en el et un le les Cluster 3: di e che il la un per una della del Cluster 4: life one new book world love time first story years
df_with_predictions_books.describe().show() # summary statistics of the dataframe
+-------+--------+--------------------+------------------+ |summary| title| description| genre| +-------+--------+--------------------+------------------+ | count| 5462026| 5462026| 5462026| | mean| NaN|9.077326391804685...|1.3543538606370604| | stddev| NaN| Infinity|1.8233677125034415| | min| | | 0| | max|zzzzzzzz|zzzziiiinnngggg s...| 4| +-------+--------+--------------------+------------------+
Using PySpark's machine learning pipeline, we were able to group authors into genres based on their descriptions. This approach could be further refined by tuning the parameters of the K-means model or experimenting with other text processing techniques and clustering algorithms.
# perform data cleaning and preprocessing on the "about" column of df_lighter_authors
df_lighter_authors = df_lighter_authors.filter(trim(col('about')) != '') # removes rows with empty about
df_lighter_authors = df_lighter_authors.dropna(subset=['about']) # removes rows with null values in about
df_lighter_authors = df_lighter_authors.withColumn('about', regexp_replace('about', '<.*?>', '')) # remove html tags
df_lighter_authors = df_lighter_authors.withColumn('about', regexp_replace('about', '[^\sa-zA-Z0-9]', '')) # remove special characters
df_lighter_authors = df_lighter_authors.withColumn('about', regexp_replace('about', '\s+', ' ')) # remove extra spaces
df_lighter_authors = df_lighter_authors.withColumn('about', regexp_replace('about', '^\s+|\s+$', '')) # remove leading and trailing spaces
df_lighter_authors = df_lighter_authors.withColumn('about', lower(col('about'))) # convert to lowercase
# summary statistics of the dataframe
df_lighter_authors.describe().show()
+-------+--------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----------------+------------------+------------------+ |summary| about| average_rating| fans_count| gender| id| image_url| name| ratings_count|text_reviews_count| works_count| +-------+--------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----------------+------------------+------------------+ | count| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| | mean|3.465256873439093...| 3.651194313281242|111.61573143586521| NULL|7751861.1911975825| NULL| Infinity|4770.586308550831| 330.9932426862099|25.937137366495435| | stddev|2.639288505618830...|1.2114816423862007| 2661.018138865955| NULL| 6578409.001586667| NULL| NaN|97693.95378788888|3857.5391911629276|3066.0828516236857| | min| | -31.0| -33| | 4|https://images.gr...|"Big" John McCarthy| -41| 0| 0| | max|zz packer born ja...| 5.0| 766035|wizardess| 21248016|https://s.gr-asse...| Toshl| 27003752| 608956| 1775176| +-------+--------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----------------+------------------+------------------+
We demonstrate a methodology for grouping authors into genres based on the "about" field, using the PySpark machine learning library. We implement a pipeline consisting of text processing, TF-IDF transformation, and K-means clustering.
We then proceed with the following steps in our pipeline:
Tokenization: We divide the text into individual words using PySpark's Tokenizer.
Stopwords Removal: We remove common words that typically don't carry much meaning using StopWordsRemover.
Count Vectorization: We convert our collection of text documents into vectors of token counts using CountVectorizer.
TF-IDF Transformation: We calculate the Inverse Document Frequency to weigh the importance of words across the documents using IDF.
K-means Clustering: We group authors into clusters (or genres) based on the TF-IDF features of their "about" text using K-means.
After fitting the model, we extract the cluster centers, which represent the centroid of each cluster in the feature space. We also identify the top 10 words associated with each cluster, which give us an insight into the characteristics of each cluster. The authors, along with their corresponding genre predictions, are then displayed.
# Tokenization
tokenizer = Tokenizer(inputCol="about", outputCol="words")
# Remove stop words and special characters
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
# Count Vectorization
countVectorizer = CountVectorizer(inputCol="filtered_words", outputCol="raw_features")
# TF-IDF
idf = IDF(inputCol="raw_features", outputCol="features")
# K-means clustering
kmeans = KMeans(k=5, seed=55, featuresCol="features", predictionCol="genre")
# Create a pipeline
pipeline = Pipeline(stages=[tokenizer, remover, countVectorizer, idf, kmeans])
# Fit the pipeline model
model = pipeline.fit(df_lighter_authors)
# Transform the DataFrame to get genre predictions
df_with_predictions_authors = model.transform(df_lighter_authors)
# Show the cluster centers
print("Cluster Centers:")
centers = model.stages[-1].clusterCenters()
for center in centers:
print(center)
Cluster Centers: [0.28937787 0.35162207 0.2722106 ... 0. 0. 0. ] [6.87610163e+01 3.51104812e-02 4.41389590e-02 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [2.69259078e+01 2.26954022e-02 3.72203737e-02 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [3.76836256e+01 3.71893913e-02 3.31042193e-02 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00] [1.92529479e-01 1.35175495e+00 1.68329403e+00 ... 3.67301745e-04 3.67301745e-04 3.67301745e-04]
# Show authors with their corresponding genre predictions
df_with_predictions_authors.select("name", "genre").show()
+--------------------+-----+ | name|genre| +--------------------+-----+ | Douglas Adams| 4| | Bill Bryson| 4| | Jude Fisher| 0| |James Hamilton-Pa...| 0| | Mark Watson| 0| | Edith Wharton| 4| | Luther Butler| 4| | Gary Paulsen| 0| | Dale Peck| 0| | Angela Knight| 4| | Delia Sherman| 0| |Patricia A. McKillip| 4| | Anne McCaffrey| 4| |Zilpha Keatley Sn...| 0| | Kate Horsley| 4| | Elaine Cunningham| 0| | Philippa Carr| 0| | Edward P. Jones| 0| | Satyajit Das| 0| | Mark Smylie| 4| +--------------------+-----+ only showing top 20 rows
# Extract the vocabulary from CountVectorizer
vocabulary = model.stages[2].vocabulary
# Get the top N words from each cluster
N = 10 # Number of words per cluster
for i, center in enumerate(centers):
topN = center.argsort()[-N:][::-1]
print(f"Cluster {i}:")
for idx in topN:
print(vocabulary[idx])
Cluster 0: author writing university books de lives book new one also Cluster 1: de en y la el que del su con por Cluster 2: de e en la y em da el o que Cluster 3: de et la le en des il les du dans Cluster 4: book writing first years new university also books life published
df_with_predictions_authors.describe().show() # summary statistics of the dataframe
+-------+--------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----------------+------------------+------------------+------------------+ |summary| about| average_rating| fans_count| gender| id| image_url| name| ratings_count|text_reviews_count| works_count| genre| +-------+--------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----------------+------------------+------------------+------------------+ | count| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| 351767| | mean|3.465256873439093...| 3.651194313281242|111.61573143586521| NULL|7751861.1911975825| NULL| Infinity|4770.586308550831| 330.9932426862099|25.937137366495435| 0.781585538154517| | stddev|2.639288505618830...|1.2114816423862007| 2661.018138865955| NULL| 6578409.001586667| NULL| NaN|97693.95378788888|3857.5391911629276|3066.0828516236857|1.5495874489987826| | min| | -31.0| -33| | 4|https://images.gr...|"Big" John McCarthy| -41| 0| 0| 0| | max|zz packer born ja...| 5.0| 766035|wizardess| 21248016|https://s.gr-asse...| Toshl| 27003752| 608956| 1775176| 4| +-------+--------------------+------------------+------------------+---------+------------------+--------------------+-------------------+-----------------+------------------+------------------+------------------+
Grouping authors into genres based on their "about" text provides valuable insights into their writing styles and thematic interests. The K-means algorithm, combined with PySpark's text processing, proves useful for this task. Future refinements could include parameter tuning, experimenting with different clustering algorithms, or using additional text processing techniques.
As per our high-level analysis, the degree of match between the two procedures would depend on several factors, including the quality and specificity of the description and about fields, the granularity of the genres (i.e., the number of clusters), and the relevance of the description and about fields to the actual genres.
Since these fields contain a lot of noise or irrelevant information, and the genres are too granular, the match will be less precise.
Therefore, the correspondence between these two procedures may be lower than expected due to noise in the data and the granularity of the genres.